开发者

Sum all elements in a quadword vector in ARM assembly with NEON

开发者 https://www.devze.com 2023-03-25 20:41 出处:网络
Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values i

Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision r开发者_Python百科egister. I think the instruction VPADD can do what I need but I'm not quite sure.


You might try this (it's not in ASM, but you should be able to convert it easily):

float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);

In ASM it would be probably only VADD and VPADD.

I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...

PS. I'm new to NEON too


It seems that you want to get the sum of a certain length of array, and not only four float values.

In that case, your code will work, but is far from optimized :

  1. many many pipeline interlocks

  2. unnecessary 32bit addition per iteration

Assuming the length of the array is a multiple of 8 and at least 16 :

  vldmia {q0-q1}, [pSrc]!
  sub count, count, #8
loop:
  pld [pSrc, #32]
  vldmia {q3-q4}, [pSrc]!
  subs count, count, #8
  vadd.f32 q0, q0, q3
  vadd.f32 q1, q1, q4
  bgt loop

  vadd.f32 q0, q0, q1
  vpadd.f32 d0, d0, d1
  vadd.f32 s0, s0, s1
  • pld - while being an ARM instruction and not NEON - is crucial for performance. It drastically increases cache hit rate.

I hope the rest of the code above is self explanatory.

You will notice that this version is many times faster than your initial one.


Here is the code in ASM:

    vpadd.f32 d1,d6,d7    @ q3 is register that needs all of its contents summed          
    vadd.f32 s1,s2,s3     @ now we add the contents of d1 together (the sum)                
    vadd.f32 s0,s0,s1     @ sum += s1;

I may have forgotten to mention that in C the code would look like this:

float sum = 1.0f;
sum += number1 * number2;

I have omitted the multiplication from this little piece asm of code.

0

精彩评论

暂无评论...
验证码 换一张
取 消