Your dynamically allocated shared memory size is incorrect. Right now you are doing this.
Up vote 1 down vote favorite share g+ share fb share tw.
I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS. My kernel looks like the following: __global__ void krnlSpMVmul1( float *data_mat, int num_nonzeroes, unsigned int *row_ptr, float *data_vec, float *data_result) { extern __shared__ float local_result; local_resultthreadIdx.
X = 0; float vector_elem = data_vecblockIdx. X; unsigned int start_index = row_ptrblockIdx. X; unsigned int end_index = row_ptrblockIdx.
X + 1; for (int index = (start_index + threadIdx. X); (index >>( dev_data_mat, num_nonzeroes, dev_row_ptr, dev_data_vec, dev_data_result); I hope this is straightforward but will explain things if it is of any interest. One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well.
Again just the last value is not being computed. Cuda cuda-kernel link|improve this question edited Feb 28 at 12:41 asked Feb 28 at 12:11Random-I-Am487 100% accept rate.
It is very likely that the problem is in code you have omitted. Could you also show the kernel arguments you are using to call the kernel? – talonmies Feb 28 at 12:15.
Your dynamically allocated shared memory size is incorrect. Right now you are doing this: krnlSpMVmul1>>(.....) The shared memory size should be given in bytes. Using your 64 threads per block case, that means you would be allocating enough shared memory for 16 float sized words and explains why the magic 17 entries per row case results in failure - you have a shared buffer overflow which will trigger a protection fault in the GPU and abort the kernel.
You should be doing something like this: krnlSpMVmul1>>(.....) That will give you the correct dynamic shared memory size and should eliminate the problem.
One last question. I tried to run the kernel with real data. I got a matrix with thousands of rows.
It seemed that all rows (not the ones with too much nonzeroes) were calculated correctly. How can that be if the kernel should fail as soon as the first out of bounds access occurs? – Random-I-Am Feb 28 at 13:02 1 The answer to that probably depends on which GPU you are using (on older hardware the results might just be wrong, on a Fermi card you should get an unspecified launch failure error if you check for it correctly).
I also would recommend running your code with cuda-memcheck. It will detect and report out of bounds shared and global memory accesses should they occur. – talonmies Feb 28 at 13:07 Thanks a lot for your effort.
Its really appreciated (In fact I really have a CC1.1 device running) – Random-I-Am Feb 28 at 13:09.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.