Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter kernel square( __global float *input, __global float *output, __local float *temp, const unsigned int count) { int gtid = get_global_id(0); int ltid = get_local_id(0); if (gtid.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example. Using your squaring kernel, you could stage the data in an intermediate buffer.
Remember to pass in the additional parameter. __kernel square( __global float *input, __global float *output, __local float *temp, const unsigned int count) { int gtid = get_global_id(0); int ltid = get_local_id(0); if (gtid.
I've read through the NVIDIA introductory material, and I still find the examples too complex. I'm looking for an über-simple 1-dimensional example of using local memory to get my feet wet. – splicer Apr 2 '10 at 12:54 Thanks for adding code in your last edit!
I can't seem to get your kernel working though.... How would I use clSetKernelArg() for temp? Do I need to use clCreateBuffer() for temp? Also, there are a few typos in your kernel: "temp * temp" should be "templtid * templtid", and a closing brace should be inserted before the last line.
– splicer Apr 3 '10 at 22:56 Running on the CPU under Snow Leopard, I tried clSetKernelArg(kernel, 2, sizeof(cl_float), NULL); but it crashes. Any ideas? – splicer Apr 3 '10 at 23:11 1 I corrected the typos - serves me right for typing on ipod.
Your clSetKernelArg is not allocating enough memory though, you need space for one cl_float per thread (you have only allocated one float). Try: clSetKernelArg(kernel, 2, sizeof(cl_float) * local_work_size0, NULL); where local_work_size0 is the work group size in dimension 0. – Tom Apr 4 '10 at 14:36 Thanks!
Looks like you're missing a semicolon on line 11. On the CPU, get_local_size(0) returns 1 for me, so shouldn't my use clSetKernelArg work? Is this a bug in Apple's implementation?
– splicer Apr 6 '10 at 13:37.
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local: __local float localBuffer1024; This removes code due to less clSetKernelArg calls.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.