开发者

Fast multiplication of values in an array

开发者 https://www.devze.com 2023-01-15 05:08 出处:网络
Is there a fast w开发者_运维技巧ay to multiply values of a float array in C++, to optimize this function (where count is a multiple of 4):

Is there a fast w开发者_运维技巧ay to multiply values of a float array in C++, to optimize this function (where count is a multiple of 4):

void multiply(float* values, float factor, int count)
{
    for(int i=0; i < count; i++)
    {
        *value *= factor;
        value++;
    }
}

A solution must work on Mac OS X and Windows, Intel and non-Intel. Think SSE, vectorization, compiler (gcc vs. MSVC).


If you want your code to be cross-platform, then either you're going to have to write platform-independent code, or you're going to have to write a load of #ifdefs.

Have you tried some manual loop unrolling, and seeing if it makes any difference?


Since you know the count is a multiple of 4, you can unroll your loop...

void multiply(float* values, float factor, int count)
{
    count = count >> 2; // count / 4
    for(int i=0; i < count ; i++)
    {
        *value *= factor;
        *(value+1) *= factor;
        *(value+2) *= factor;
        *(value+3) *= factor;
        value += 4;
    }
}


Disclaimer: obviously, this won't work on iPhone, iPad, Android, or their future equivalents.

#include <mmintrin.h>
#include <xmmintrin.h>

__m128 factor4 = _mm_set1_ps(factor);
for (int i=0; i+3 < count; i += 4)
{
   __m128 data = _mm_mul_ps(_mm_loadu_ps(values), factor4);
   _mm_storeu_ps(values, data);
   values += 4;
}
for (int i=(count/4)*4; i < count; i++)
{
   *values *= factor;
   value++;
}


Have you thought of OpenMP?

Most modern computers have multi-core CPUs and nearly every major compiler seems to have OpenMP built-in. You gain speed at barely any cost.

See Wikipedia's article on OpenMP.


The best solution is to keep it simple, and let the compiler optimize it for you. GCC knows about SSE, SSE2, altivec and what else. If your code is too complex, your compiler won't be able to optimize it on every possible target.


As you mentioned, there are numerous architectures out there that have SIMD extensions and SIMD is probably your best bet when it comes to optimization. They are all however platform specific and the C and C++ as languages are not SIMD friendly.

The first thing you should try however is enabling the SIMD specific flags for your given build. The compiler may recognize patterns that can be optimized with SIMD.

The next thing is to write platform specific SIMD code using compiler intrinsics or assembly where appropriate. You should however keep a portable non-SIMD implementation for platforms that do not have an optimized version. #ifdefs enable SIMD on platforms that support it.

Lastly, at least on ARM but not sure on Intel, be aware that smaller integer and floating point types allow a larger number of parallel operations per single SIMD instruction.


I think, there is not a lot you can do that makes a big difference. Maybe you can speed it up a little with OpenMP or SSE. But Modern CPUs are quite fast already. In some applications memory bandwidth / latency is actually the bottleneck and it gets worse. We already have three levels of cache and need smart prefetch algorithms to avoid huge delays. So, it makes sense to think about memory access patterns as well. For example, if you implement such a multiply and an add and use it like this:

void multiply(float vec[], float factor, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] *= factor;
}

void add(float vec[], float summand, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] += summand;
}

void foo(float vec[], int size)
{
  multiply(vec,2.f,size);
  add(vec,9.f,size);
}

you're basically passing twice over the block of memory. Depending on the vector's size it might not fit into the L1 cache in which case passing over it twice adds some extra time. This is obviously bad and you should try to keep memory accesses "local". In this case, a single loop

void foo(float vec[], int size)
{
  for (int i=0; i<size; ++i) {
    vec[i] = vec[i]*2+9;
  }
}

is likely to be faster. As a rule of thumb: Try to access memory linearly and try to access memory "locally" by which I mean, try to reuse the data that is already in the L1 cache. Just an idea.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号