In the case of the synchronous API call with regular pageable user allocated memory, the answer is it runs on both . The driver must first copy data from the source memory to a DMA mapped source buffer on the host, then signal to the GPU that data is waiting for transfer. The GPU then executes the transfer.
The process is repeated as many times as necessary for the complete copy from source memory to the GPU.
I believe a transfer from host to GPU memory is a blocking call. It uses the entire bus and, as such, it doesn't really make sense (even if it was physically possible) to run multiple operations in parallel.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.