It seems rather long for a cudaMalloc. Also check your driver is up-to-date.
It is understandable. The nvcc embeds ptx code into the application binary which has to compiled to native gpu binary using a JIT compiler. This accounts for the start up delay.
AFAIK malloc is not slower than memcpy .
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.