I am trying to normalize a 4d vector.
My first approch was to use SSE intrinsics - something that provided a 2 times speed boost to my vector arithmetic. Here is the basic code: (v.v4 is the input) (using GCC) (all of this is inlined)
//find squares
v4sf s = __builtin_ia32_mulps(v.v4, v.v4);
//set t to square
v4sf t = s;
//add the 4 squares together
s = __builtin_ia32_shufps(s, s, 0x1B);
t = __builtin_ia32_addps(t, s);
s = __builtin_ia32_shufps(s, s, 0x4e);
t = __builtin_ia32_addps(t, s);
s = __builtin_ia32_shufps(s, s, 0x1B);
t = __builtin_ia32_addps(t, s);
//find 1/sqrt of t
t = __builtin_ia32_rsqrtps(t);
//multiply to get normal
return Vec4(__builtin_ia32_mulps(v.v4, t));
I check the disassembly and it looks like how I would expect. I don't see any big problems there.
Anyways, then I tried it using an approximation: (I got this from google)
float x = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // give initial guess y0
x = *(float*)&i; // convert bits back to float
x *= 1.5f - xhalf*x*x; // newton step, repeating this step
// increases accuracy
//x *= 1.5f - xhalf*x*x;
return Vec4(v.w*x, v.x*x, v.y*x, v.z*x);
It is running slightly faster than the SSE version! (about 5-10% faster) It's results also are very accurate - I would say to 0.001 when finding length! But.. GCC is giving me that lame strict aliasing rule because of the type punning.
So I modify it:
union {
float fa;
int ia;
};
fa = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float faHalf = 0.5f*fa;
ia = 0x5f3759df - (ia>>1);
fa *= 1.5f - faHalf*fa*fa;
//fa *= 1.5f - faHalf*fa*fa;
return Vec4(v.w*fa, v.x*fa, v.y*fa, v.z*fa);
And now the modified version (with no warn开发者_如何转开发ings) is running slower!! It's running almost 60% the speed that SSE version runs (but same result)! Why is this?
So here is question(s):
- Is my SSE implentation correct?
- Is SSE really slower than normal fpu operations?
- Why the hell is the 3rd code so much slower?
I am a dope - I realized I had SETI@Home running while benchmarking. I'm guessing it was killing my SSE performance. Turned it off and got it running twice as fast.
I also tested it on an AMD athlon and got the same results - SSE was faster.
At least I fixed the shuf bug!
Here is the most efficient assembly code i can think of. You can compare this to what your compiler generates. assume the input and output are in XMM0.
; start with xmm0 = { v.x v.y v.z v.w }
movaps %xmm0, %mm1 ; save it till the end
mulps %xmm0, %xmm0 ; v=v*v
pshufd $1, %xmm0, %xmm1 ; xmm1 = { v.y v.x v.x v.x }
addss %xmm0, %xmm1 ; xmm1 = { v.y+v.x v.x v.x v.x }
pshufd $3, %xmm0, %xmm2 ; xmm2 = { v.w v.x v.x v.x }
movhlps %xmm0, %xmm3 ; xmm3 = { v.z v.w ? ? }
addss %xmm1, %xmm3 ; xmm3 = { v.y+v.x+v.z v.x ? ? }
addss %xmm3, %xmm2 ; xmm2 = { v.y+v.x+v.z+v.w v.x v.x v.x }
rsqrtps %xmm2, %xmm1 ; xmm1 = { rsqrt(v.y+v.x+v.z+v.w) ... }
pshufd $0, %xmm1, %xmm1 ; xmm1 = { rsqrt(v.y+v.x+v.z+v.w) x4 }
mulps %xmm1, %xmm0
; end with xmm0 = { v.x*sqrt(...) v.y*sqrt(...) v.z*sqrt(...) v.w*sqrt(...) }
My guess is that the 3rd version is slower because the compiler decides to put the union in a memory variable. In the cast case, it can copy the values from register to register. You can just look at the generated machine code.
As to why SSE is inaccurate, I don't have an answer. It would help if you can give real numbers. If the difference is 0.3 on a vector of size 1, that would be outrageous.
精彩评论