You're not synchronizing the summing properly to the blockDim. X location. None of the threads are waiting to see what others have written before adding their sum.
Sort of like.
You're not synchronizing the summing properly to the blockDim. X location. None of the threads are waiting to see what others have written before adding their sum.
Sort of like Everyone reads zero, goes home, calculates zero + numer. Everone writes zero+numer to the memory location The high threadId wins because it has a high likelihood of acting last, I suppose. What you want to do instead, in order to do a quick sum, is to do a binary sum on s_sharedthreadIdx.
X everyone writes their numer half the threads calculate sums of pairs and write those to a new location a quarter of the threads caluclate the sums of pairs of pairs, and write those to a new location etc until you just have one thread and one sum This takes O(n) work and O(log n) time.
4 Just to make a note of this, the logic here is known as a reduction. There are a few samples of this in the cuda sdk. See: cuda-sdk/C/src/reduction/reduction_kernel.
Cu – sharth Mar 5 '10 at 19:08.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.