There's many situations where scenario 2 won't give you linear speedup either -- false sharing between threads (or, for that matter, true sharing of shared variables which get modified), memory bandwidth contention, etc. The sub-linear speedup is generally real, not a measurement artifact.
Up vote 4 down vote favorite share g+ share fb share tw.
I have two scenarios of measuring metrics like computation time and parallel speedup (sequential_time/parallel_time). Scenario 1: Sequential time measurement: startTime=omp_get_wtime(); for loop computation endTime=omp_get_wtime(); seq_time = endTime-startTime; Parallel time measurement: startTime = omp_get_wtime(); for loop computation (#pragma omp parallel for reduction (+:pi) private (i) for (blah blah) { computation; } endTime=omp_get_wtime(); paralleltime = endTime-startTime; speedup = seq_time/paralleltime; Scenario 2: Sequential time measurement: for loop{ startTime=omp_get_wtime(); computation; endTime=omp_get_wtime(); seq_time += endTime-startTime; } Parallel time measurement: for loop computation (#pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime) for (blah blah) { startTime=omp_get_wtime(); computation; endTime=omp_get_wtime(); paralleltime = endTime-startTime; } speedup = seq_time/paralleltime; I know that Scenario 2 is NOT the best production code, but I think that it measures the actual theoretical performance by OVERLOOKING the overhead involved in openmp spawning and managing (thread context switching) several threads. So it will give us a linear speedup.
But Scenario 1 considers the overhead involved in spawning and managing threads. My doubt is this: With Scenario 1, I am getting a speedup which starts out linear, but tapers off as we move to a higher number of iterations. With Scenario 2, I am getting a full on linear speedup irrespective of the number of iterations.
I was told that in reality, Scenario 1 will give me a linear speedup irrespective of the number of iterations. But I think it will not because of the high overload due to thread management. Can someone please explain to me why I am wrong?
Thanks! And sorry about the rather long post. Performance parallel-processing parallel parallel-programming openmp link|improve this question edited Oct 28 '11 at 15:26evnu1,02619 asked Oct 28 '11 at 5:30Neo59129 93% accept rate.
1 Parallel time measurement in scenario 2 is incorrect, because at each iteration you re-write the result obtained by the previous iteration on the same thread. And if you change it to paralleltime += endTime-startTime; (notice +=), you will sum up the time of all iterations which will be basically the same as seq_time. – Alexey Kukanov Oct 28 '11 at 12:27 That is why I am performing a reduce operation on both pi and paralleltime #pragma omp parallel for reduction (+:pi, paralleltime) private (i,startTime,endTime) – Neo Oct 28 '11 at 17:28 1 Reduction of paralleltime does not make your program correct.
You may check the OpenMP specification, particularly section 2.9.3.6, for details. Shortly, each private copy of a reduction variable accumulates the result of several loop iterations executed by the same thread. To illustrate, let's assume the 1st and 2nd loop iterations are executed by the same thread; then the 2nd iteration will write its time over that of the 1st iteration, which will simply be lost.
– Alexey Kukanov Oct 28 '11 at 18:50 Thanks Alexey! I did not know that reduction works in this way.. – Neo Oct 28 '11 at 22:48.
There's many situations where scenario 2 won't give you linear speedup either -- false sharing between threads (or, for that matter, true sharing of shared variables which get modified), memory bandwidth contention, etc. The sub-linear speedup is generally real, not a measurement artifact. More generally, once you get to the point where you're putting timers inside for loops, you're considering more fine-grained timing information than is really appropriate to measure using timers like this. You might well want to be able to disentangle the thread management overhead from the actual work being done for a variety of reasons, but here you're trying to do that by inserting N extra function calls to omp_get_wtime(), as well as the arithmetic and the reduction operation, all of which will have non-negligable overhead of their own.
If you really want accurate timing of how much time is being spent in the computation; line, you really want to use something like sampling rather than manual instrumentation (we talk a little bit about the distinction here). Using gprof or scalasca or openspeedshop (all free software) or Intel's VTune or something (commercial package) will give you the information about how much time is being spent on that line -- often even by thread -- with much lower overhead.
First of all, by the definition of the speedup, you should use the scenario 1, which includes parallel overhead. In the scenario 2, you have the wrong code in the measurement of paralleltime. To satisfy your goal in the scenario 2, you need to have a per-thread paralleltime by allocating int paralleltimeNUM_THREADS and accessing them by omp_get_thread_num() (Note that this code will have false sharing, so you'd better to allocate 64-byte struct with padding).
Then, measure per-thread computation time, and finally take the longest one to calculate a different kind of speedup (I'd say a sort of parallelism instead). No, you may see sub-linear speedup for even Scenario 2, or even super-linear speedup could be obtained. The potential reasons (i.e.
, excluding parallel overhead) are: Load imbalance: the workload length in compuation is different on iteration. This would be the most common reason of low speedup (But, you're saying load imbalance is not the case). Synchronization cost: if there are any kind of synchronizations (e.g. , mutex, event, barrier), you may have waiting/block time.
Caches and memory cost: when computation requires large bandwidth and high working set set, parallel code may suffer from bandwidth limitation (though it's rare in reality) and cache conflicts. Also, false sharing would be a significant reason, but it's easy to avoid it. Super-linear effect can be also observed because using multicore may have more caches (i.e.
, private L1/L2 caches). In the scenario 1, it will include the overhead of parallel libraries: Forking/joining threads: although most parallel libraries implementations won't do physical thread creation/termination on every parallel construct. Dispatching/joining logical tasks: even if physical threads are already created, you need to dispatch logical task to each thread (in general M task to N thread), and also do a sort of joining operation at the end (e.g. , implicit barrier).
Scheduling overhead: for static scheduling (as shown in your code, which uses OpenMP's static scheduling), the overhead is minimal. You can safely ignore the overhead when the workload is sufficiently enough (say 0.1 second). However, dynamic scheduling (such as work-stealing in TBB) has some overhead, but it's not significant once your workload is sufficient.
I don't think your code (1-level static-scheduling parallel loop) does have high parallel overhead due to thread management, unless this code is called million times per a second. So, may be the other reasons that I've mentioned in the above. Keep in mind that there are many factors that will determine speedup; from the inherent parallelism (=load imbalance and synchronizations) to the overhead of a parallel library (e.g. , scheduling overhead).
Reg. The paralleltime computation, I am performing a reduction on paralleltime as well in Scenario 2. So the paralleltime I get will be correct, right?
– Neo Oct 28 '11 at 17:41 btw, I love your answer. Gives a nice detailed explanation. But I somehow felt I had to choose the other one.
:) – Neo Oct 28 '11 at 17:49.
The overhead due to parallelism is part of the real execution time, hence IMHO scenario 1 is better. Besides, according to your OpenMP directives, you're doing a reduction on some array. In scenario 1, you're taking this into account.
In scenario 2, you're not. So basically, you're measuring less things than in scenario 1. This probably has some impact on your measurements too.
Otherwise, Jonathan Dursi's answer is excellent.
There are several options with OpenMP for how the work is distributed among the threads. This can impact the linearity of your measurement method 1. You measurement method 2 doesn't seem useful.
What were you trying to get at with that? If you want to know single thread performance, then run a single thread. If you want parallel performance, then you need to include the overhead.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.