I believe it is usual to have such code in C++
for(size_t i=0;i<ARRAY_SIZE;++i)
A[i]=B[i]*C[i];
One commonly advocated alternation is:
double* pA=A,pB=B,pC=C;
for(size_t i=0;i<ARRAY_SIZE;++i)
*pA++=(*pB++)*(*pC++);
What I am wondering is, the best way of improving this code, as IMO following things needed to be considered:
- CPU cache. How CPUs fill up their caches to gain best hit rate?
- I suppose SSE could improve this?
- The other thing is, what if the code could be parallelized? E.g. using OpenMP. In this case, pointer trick may not be avai开发者_如何学编程lable.
Any suggestions would be appreciated!
My g++ 4.5.2 produces absolutely identical code for both loops (having fixed the error in double *pA=A, *pB=B, *pC=C;
, and it is
.L3:
movapd B(%rax), %xmm0
mulpd C(%rax), %xmm0
movapd %xmm0, A(%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
(where my ARRAY_SIZE was 10000)
The compiler authors know these tricks already. OpenMP and other concurrent solutions are worth investigating, though.
The rule for performance are
not yet
get a target
measure
get an idea of how much improvement is possible and verify it is worthwhile to spend time to get it.
This is even more true for modern processors. About your questions:
simple index to pointer mapping is often done by the compilers, and when they don't do it they may have good reasons.
processors are already often optimized to sequential access to the cache: simple code generation will often give the best performance.
SSE can perhaps improve this. But not if you are already bandwidth limited. So we are back to the measure and determine bounds stage
parallelization: same thing as SSE. Using the multiple cores of a single processor won't help if you are bandwidth limited. Using different processor may help depending on the memory architecture.
manual loop unwinding (suggested in a now deleted answer) is often a bad idea. Compilers know how to do this when it is worth-wise (for instance if it can do software pipelining), and with modern OOO processors it is often not the case (it increase the pressure on instruction and trace caches while OOO execution, speculation over jumps and register renaming will automatically brings most of the benefit of unwinding and software pipelining).
The first form is exactly the sort of structure that your compiler will recognize and optimize, almost certainly emitting SSE instructions automatically.
For this kind of trivial inner loop, cache effects are irrelevant, because you are iterating through everything. If you have nested loops, or a sequence of operations (like g(f(A,B),C)), then you might try to arrange to access small blocks of memory repeatedly to be more cache-friendly.
Do not unroll the loop by hand. Your compiler will already do that, too, if it is a good idea (which it may not be on a modern CPU).
OpenMP can maybe help if the loop is huge and the operations within are complicated enough that you are not already memory-bound.
In general, write your code in a natural and straightforward way, because that is what your optimizing compiler is most likely to understand.
When to start considering SSE or OpenMP? If both of these are true:
- If you find that code similar to yours appear 20 times or more in your project:
for (size_t i = 0; i < ARRAY_SIZE; ++i)
A[i] = B[i] * C[i];
or some similar operations - If
ARRAY_SIZE
is routinely bigger than 10 million, or, if profiler tells you that this operation is becoming a bottleneck
Then,
- First, make it into a function:
void array_mul(double* pa, const double* pb, const double* pc, size_t count)
{ for (...) }
- Second, if you can afford to find a suitable SIMD library, change your function to use it.
- Good portable SIMD library
- SIMD C++ library
As a side note, if you have a lot of operations that are only slightly more complicated than this, e.g. A[i] = B[i] * C[i] + D[i]
then a library which supports expression template will be useful too.
You can use some easy parallelization method. Cuda will be hardware dependent but SSE is almost standard in every CPU. Also you can use multiple threads. In multiple threads you can still use pointer trick which is not very important. Those simple optimizations can be done by compiler as well. If you are using Visual Studio 2010 you can use parallel_invoke to execute functions in parallel without dealing with windows threads. In Linux pThread library is quite easy to use.
I think using valarrays are specialised for such calculations. I am not sure if it will improve the performance.
精彩评论