OpenMP parallelizing matrix multiplication by a triple for loop (performance issue)?

Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.

Up vote 3 down vote favorite 3 share g+ share fb share tw.

I'm writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x B rows x columns, for better cache efficiency. Doing this I faced an interesting fact that for me is illogic: if in this code I parallelize the extern loop the program is slower than if I put the OpenMP directives in the most inner loop, in my computer the times are 10.9 vs 8.1 seconds. //A and B are double* allocated with malloc, Nu is the lenght of the matrixes //which are square //#pragma omp parallel for for (i=0; I ; #pragma omp parallel for for(k=0;k.

1 By tweaking omp parameters I've got 200% speed up on my machine. Original: llcomp.googlecode.com/hg/examples/mxm.c current: codepad.org/nSfZHp03 – J.F. Sebastian Jan 18 '11 at 22:01 Nice solution. Yeah, OpenMP is kinda tricky – Elalfer Jan 18 '11 at 22:04 Code that uses 'fortran' memory layout for the B matrix runs 4-8 faster (the greatest benefit) for 1000x1000 matrices (threaded version takes 0.5 seconds).

Gist.github.com/790865 – J.F. Sebastian Jan 22 '11 at 4:52.

Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.

Also, use of restrict may help. Otherwise the compiler can't guarantee that writes to C aren't changing A and B. Try: for (i=0; i.

Thanks for the answer, I'll try then I'll come back – RR576 Jan 18 '11 at 19:03 Incredibile, time has became only 4.2 s with the most inner loop and 4.4 with the most outer (!), while the code with #pragma like in the code you posted time is >17, I don't know why. Thanks really to all, even if is don't understand why with the most outer is slightly slower than the most inner – RR576 Jan 18 '11 at 19:25 @RR576: Check the results, you may not have the right output when parallelizing the innermost loop without specifying a reduction operation. – Ben Voigt Jan 18 '11 at 19:35 yes, you are right even on this.

I committed several errors during the program, and your guess is correct, with the reduction the inner works (4.2s) but the most outer is the more efficient (3.9s! ), while the central is very slow, around 20, this I think is due to the cacheline (the address varies with I very fast), so the apparent paradox is revealed, tomorrow morning I have the exam on scientific programming...thanks again to you and Elalfer – RR576 Jan 18 '11 at 19:59 there is a typo in Bcol = B + i*Nu it should be j. – J.F. Sebastian Jan 22 '11 at 4:34.

You could probably have some dependencies in the data when you parallelize the outer loop and compiler is not able to figure it out and adds additional locks. Most probably it decides that different outer loop iterations could write into the same (C+(i*Nu+j)) and it adds access locks to protect it. Compiler could probably figure out that there are no dependencies if you'll parallelize the 2nd loop.

But figuring out that there are no dependencies parallelizing the outer loop is not so trivial for a compiler. UPDATE Some performance measurements. It looks like 1000 double * and + is not enough to cover the cost of threads synchronization.

I've done few small tests and simple vector scalar multiplication is not effective with openmp unless the number of elements is less than ~10'000. Basically, larger your array is, more performance will you get from using openmp. So parallelizing the most inner loop you'll have to separate task between different threads and gather data back 1'000'000 times.

PS. Try Intel ICC, it is kinda free to use for students and open source projects. I remember being using openmp for smaller that 10'000 elements arrays.

UPDATE 2: Reduction example double sum = 0.0; int k=0; double *al = A+i*Nu; double *bl = A+j*Nu; #pragma omp parallel for shared(al, bl) reduction(+:sum) for(k=0;k.

The loop has no carried dependency, all iterations are independent – RR576 Jan 18 '11 at 17:04 You can see it, but compiler is not an AI and could miss it ;) I'm actually had a lot of battles with OpenMP & icc regarding to this stuff. – Elalfer Jan 18 '11 at 17:05 sorry for my arrogance, you are surely more expert than me, I'll check. If I parallelize the second loop the result is more than 15 seconds.

– RR576 Jan 18 '11 at 17:15 One notice: Have you tried to use reduction clause for the most inner loop? I'll try this code later. It looks like fun to remember how to work with OpenMP.

Which compiler are you using? Gcc or icc? And what is the size of your matrix?

– Elalfer Jan 18 '11 at 17:18 if you want I can send you my code. The matrix is big (1000x1000). I see no space to use reduction (C is a pointer), in the most outer loop you can't use in every case (on what you reduct?

). The problem is that I'm am not a computer engineer, I don't know how the computer memory "works" in a physical way, I know how to use the cache line and as you can see I used that information for the multiplication rows x rows, but my knowledge stops here. For me the problem is in the use of cache, only this feature can add all this time to execute.

Thanks for the answer, I'm looking forward for your ideas – RR576 Jan 18 '11 at 17:40.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions