The intrinsics correspond relatively directly to actual instructions, but compilers are not obligated to issue the corresponding instructions. Optimizing a load followed by an operation (even when written in intrinsics) into the memory form of the operation is a common optimization performed by all respectable compilers when it is advantageous to do so TLDR: write the load and the operation in intrinsics, and let the compiler optimize it Edit: trivial example: include __m128i foo(__m128i *addr) { __m128i a = _mm_load_si128(addr); __m128i be = _mm_load_si128(addr + 1); return _mm_unpacklo_epi8(a, b); } Compiling with gcc -Os -fomit-frame-pointer gives: foo: movdqa (%rdi), %xmm0 punpcklbw 16(%rdi), %xmm0 retq See? The optimizer will sort it out.
The intrinsics correspond relatively directly to actual instructions, but compilers are not obligated to issue the corresponding instructions. Optimizing a load followed by an operation (even when written in intrinsics) into the memory form of the operation is a common optimization performed by all respectable compilers when it is advantageous to do so. TLDR: write the load and the operation in intrinsics, and let the compiler optimize it.
Edit: trivial example: #include __m128i foo(__m128i *addr) { __m128i a = _mm_load_si128(addr); __m128i be = _mm_load_si128(addr + 1); return _mm_unpacklo_epi8(a, b); } Compiling with gcc -Os -fomit-frame-pointer gives: _foo: movdqa (%rdi), %xmm0 punpcklbw 16(%rdi), %xmm0 retq See? The optimizer will sort it out.
I wouldn't complain if the compiler optimized it, but at least clang and gcc don't. This is easy to check with the -S option. I find verbatim intrinsics -> assembly translations for almost any intrinsic and can map registers directly to variables.
Looks like these compilers hardly optimize SIMD intrinsics code... – dietr Jul 28 '10 at 20:15 1 @dietr: clang and gcc both do this optimization, as you can see in my example. Are you building with optimization turned off? Try using -O1 or higher.
– Stephen Canon Jul 28 '10 at 20:23 I'm using -O2. I guess gcc/clang simply don't see any optimization potential in my particular piece of code... – dietr Jul 28 '10 at 23:33 1 @dietr: can you post the relevant snippet of your code? – Stephen Canon Jul 28 '10 at 0:39.
You can just use your memory values directly. For example: __m128i *p=static_cast(_aligned_malloc(8*4,16)); for(int i=0;i(p)i=static_cast(i); __m128i xyz=_mm_unpackhi_epi8(p0,p1); The interesting part of the result: ; __m128i xyz=_mm_unpackhi_epi8(p0,p1); 0040BC1B 66 0F 6F 00 movdqa xmm0,xmmword ptr eax 0040BC1F 66 0F 6F 48 10 movdqa xmm1,xmmword ptr eax+10h 0040BC24 66 0F 68 C1 punpckhbw xmm0,xmm1 0040BC28 66 0F 7F 04 24 movdqa xmmword ptr esp,xmm0 So the compiler is doing a bit of a poor job -- or perhaps this way is faster and/or playing with the options would fix that -- but it generates code that works, and the C++ code is stating what it wants fairly directly.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.