OpenMP C parallelizing nested for loop slow?

It is possible that your loops are too small. There is overhead associated with creating a thread to process a portion of the loop so if the loop is too small a parallelized version may run slower. Another consideration is the number of cores available.

Up vote 1 down vote favorite share g+ share fb share tw.

The inputs to the program are: numParticles (loop index) timeStep (not important, value doesn't change) numTimeSteps (loop index) numThreads (number of threads to be used) I've looked around the web and tried some things out (nowait) and nothing really changed. I'm pretty sure the parallel code is correct because I checked the outputs. Is there something wrong I'm doing here?

EDIT: Also, it seems that you can't use the reduction clause on C structures? EDIT2: Working on gcc on linux with 2 core cpu. I have tried running this with values as high as numParticles = 40 and numTimeSteps = 100000.

Maybe I should try higher? Thanks for-loop parallel openmp link|improve this question edited Aug 21 '11 at 3:07 asked Aug 21 '11 at 1:44Jason Lee63.

– Guy Sirton Aug 21 '11 at 2:51 Using gcc on linux. NumParticles is 40 and numTimeSteps is 100000. CPU has 2 cores.

I tried higher values for those two variables but the result is still same – Jason Lee Aug 21 '11 at 3:03 numTimeSteps should have no impact since it is outside the parallel region. Are you enabling openmp? ( I think -fopenmp in gcc ) – Guy Sirton Aug 21 '11 at 3:22 yep it is enabled – Jason Lee Aug 21 '11 at 3:29 OK.

What happens with numParticles = 200, numTimeSteps = 100? – Guy Sirton Aug 21 '11 at 3:47.

It is possible that your loops are too small. There is overhead associated with creating a thread to process a portion of the loop so if the loop is too small a parallelized version may run slower. Another consideration is the number of cores available.

Your second omp directive is less likely to be useful because there are a lot less calculations in that loop. I would suggest to remove it. EDIT: I tested your code with numParticles 1000 and two threads.

It ran in 30 seconds. The single threaded version ran in 57 seconds. Even with numParticles 40 I see a significant speedup.

This is Visual Studio 2010.

– Jason Lee Aug 21 '11 at 3:05 Number of particles. – Guy Sirton Aug 21 '11 at 3:15 oh right good to know that it works at least. I'm running this on linux with gcc compiler does this make any difference?

– Jason Lee Aug 21 '11 at 3:25 It does. The overhead of creating a thread may be larger. The OpenMP implementation may be slightly different.

Try a larger number of particles (e.g. 500) and see what happens. Make sure you've enabled OpenMP support on the gcc command line. – Guy Sirton Aug 21 '11 at 3:29.

I can think of two possible sources for slowdown: a) compiler made some optimizations (vectorization being first) in sequential version but not in OpenMP version, and b) thread management overhead. Both are easy to check if you also run the OpenMP version with a single thread (i.e. Set numThreads to 1).

If it is much slower than sequential, then (a) is the most likely reason; if it is similar to sequential and faster than the same code with 2 threads, the most likely reason is (b). In the latter case, you may restructure the OpenMP code for less overhead. First, having two parallel regions (#pragma omp parallel) inside a loop is not necessary; you can have a single parallel region and two parallel loops inside it: for (t = 0; t This way, you ensure that the same set of threads run through the whole computation, no matter what OpenMP implementation is used.

I was also thinking about the optimization... Perhaps auto-vectorizing. But you'd think the compiler would do the same for a sub-loop as well? – Guy Sirton Aug 21 '11 at 17:07 I'm a bit confused about why the parallel region can be started before the timestep loop.

The results of later iterations of the timestep loop is dependent on its first iteration, so wouldn't it be wrong to do that? And I have tried comparing the times with only 1 thread for the OpenMP version, and for small numParticles the execution times are same, but as I increase it, the OpenMP version gets slower and slower. With more than 1 thread, the OpenMP version is noticeably slower for even smaller numParticles – Jason Lee Aug 22 '11 at 1:22 Each timestepping iteration will be synchronized across all threads, because #pragma omp for has an implicit barrier at the end.

So a new iteration cannot start before all threads are done with the previous iteration. – Alexey Kukanov Aug 22 '11 at 10:06 @Guy Sirton: Yes, it's reasonable to expect that for the inner loop compiler will be able to apply the same optimizations. And this makes the hypothesis of compiler optimization impact less likely, which I think is also confirmed by Jason's observations.

Thread management overhead seems the most likely reason for slowdown. – Alexey Kukanov Aug 22 '11 at 10:09.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions