Prefetching in Nvidia CUDA?

You can overlap host/device memory transfers and kernels by using page-locked host memory. You could use this to...

Up vote 1 down vote favorite share g+ share fb share tw.

I'm working on data prefetching in nVidia CUDA. I read some documents on prefetching on device itself i.e. Prefetching from shared memory to cache.

But I'm interested in data prefetching between CPU and GPU. Can anyone connect me with some documents or something regarding this matter. Any help would be appreciated.

Cuda nvidia prefetch link|improve this question edited Oct 18 '11 at 8:46Sathya4,89972150 asked Oct 17 '11 at 11:13user9977041386 39% accept rate.

2 Your question is way too broad in its current form - try asking a more specific question. You might also want to check out the nVidia developer forums at developer.nvidia.com. – Paul R Oct 17 '11 at 11:19 Ok..how can I add prefetch instruction in given CUDA program?

– user997704 Oct 17 '11 at 17:32 This is still very vague - prefetch what to what exactly? For what purpose? On what generation of GPU?

– Paul R Oct 17 '11 at 20:20.

You can overlap host/device memory transfers and kernels by using page-locked host memory. You could use this to... Copy working set #1 & #2 from host to device. Process #i, promote #i+1, and load #i+2 - concurrently.

So you could be streaming data in and out of the GPU and computing on it all at once (!). Please refer to the CUDA 4.0 Programming Guide and CUDA 4.0 Best Practices Guide for more detailed information. Good luck!

You don't need CUDA 4.0 to do that. Asynchronous host/device memory trasfers is an old story. What CUDA 4.0 does is an unified memory addressing accross several GPUs.

The GPUs can now communicate without bothering the host. – CygnusX1 Oct 17 '11 at 21:34 Cool. I still think vanilla page-locked host memory can be used to do what he wants, though... Right?

– Patrick87 Oct 17 '11 at 21:51 I'm trying to apply Ping pong technique in GPU,the technique is mentioned below: – user997704 Oct 18 '11 at 5:57 when we to want perform computation on large data ideally we'll send max data to GPU,perform computation,send it back to CPU i. E SEND,COMPUTE,SEND(back to CPU) now whn it sends back to CPU GPU has to stall,now my plan is given CU program,say it runs in entire global mem,i'll compel it to run it in half of the global mem so that rest of the half I can use for data prefetching,so while computation is being performed in one half simultaneously I cn prefetch data in otherhalf. So no stalls will be there..now tell me is it feasible to do?

Performance will be degraded or upgraded? Should enhance.. – user997704 Oct 18 '11 at 6:12.

Answer based on your comment: when we to want perform computation on large data ideally we'll send max data to GPU,perform computation,send it back to CPU i. E SEND,COMPUTE,SEND(back to CPU) now whn it sends back to CPU GPU has to stall,now my plan is given CU program,say it runs in entire global mem,i'll compel it to run it in half of the global mem so that rest of the half I can use for data prefetching,so while computation is being performed in one half simultaneously I cn prefetch data in otherhalf. So no stalls will be there..now tell me is it feasible to do?

Performance will be degraded or upgraded? Should enhance.. CUDA streams were introduced to enable exactly this approach. If your compoutation is rather intensive, then yes --- it can greatly speed up your performance.

On the other hand, if data transfers take, say, 90% of your time, you will save only on computation time - that is - 10% tops... The details, including examples, on how to use streams is provided in CUDA Programming Guide. For version 4.0, that will be section "3.2.5.5 Streams", and in particular "3.2.5.5.5 Overlapping Behavior" --- there, they launch another, asynchronous memory copy, while a kernel is still running.

I think you didn't get my point...CUDA stream is not based on Ping pong that was used in DSPs few years back...i'm trying to implement this in CUDA so that,as I explained,stalls won't be there,stream may also prevent stalls,but whatever i'm proposing is something new I just want some inputs on how to implement it...coz here i'll have to scan the program for tht... – user997704 Oct 18 '11 at 9:00.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Prefetching in Nvidia CUDA?

Related Questions

Hi I am confused, which one is better 128 Bits with Cuda cores 92 or 64 Bits with Cuda cores 384?

NVIDIA NVCC and CUDA: Cubin vs. PTX?

Where does Cuda kernel code reside on nvidia GPU?

How do you calculate the load on a nvidia (cuda capable), gpu card?

Which graphics card is better? The nVidia GeForce 6200 PCI, or the nVidia GeForce FX5200 AGP?

EVGA NVIDIA GTX 670 FTW Signature Edition 2 OR Asus Nvidia GeForce 2GB GTX 670?