Suppose you want to compute the sum of the square of the differences of items:
$\sum_{i=1}^{N-1} (x_i - x_{i+1})^2$
the simplest code (the input is std::vector<double> xs
, the ouput sum2
) is:
double sum2 = 0.;
double prev = xs[0];
for (vector::const_iterator i = xs.begin() + 1;
i != xs.end(); ++i)
{
sum2 += (prev - (*i)) * (prev - (*i)); // only 1 - with compiler optimization
prev = (*i);
}
I hope that the compiler do the optimization in the comment above. If N
is the length of xs
you have N-1
multiplications and 2N-3
sums (sums means +
or -
).
Now suppose you know this variable:
$x_1^2 + x_N^2 + 2 \sum_{i=2}^{N-1} x_i^2$
and call it sum
. Expanding the binomial square:
$sum_i^{N-1} (x_i-x_{i+1})^2 = sum
- 2\sum_{i=1}^{N-1} x_i x_{i+1}$
so 开发者_如何学运维the code becomes:
double sum2 = 0.;
double prev = xs[0];
for (vector::const_iterator i = xs.begin() + 1;
i != xs.end(); ++i)
{
sum2 += (*i) * prev;
prev = (*i);
}
sum2 = -sum2 * 2. + sum;
Here I have N multiplications and N-1 additions. In my case N is about 100.
Well, compiling with g++ -O2
I got no speed up (I try calling the inlined function 2M times), why?
The multiplications are much more costly than additions in term of execution time. Also, depending on the processor additions and multiplications will be done in parallel. Ie. it will start the next multiplication while it's doing the addition (see http://en.wikipedia.org/wiki/Out-of-order_execution).
So reducing the number of additions will not help much for the performance.
What you can do is make it easier for the compiler to vectorize your code, or vectorize by yourself. To make it easier for the compiler to vectorize, I would use a regular array of doubles, use subscripts and not pointers.
EDIT: N = 100 might also be a small number to see the difference in execution time. Try a N bigger.
Dirty code but shows the perf improvement. Output:
1e+06
59031558
1e+06
18710703
The speedup you get is ~3x.
#include <vector>
#include <iostream>
using namespace std;
unsigned long long int rdtsc(void)
{
unsigned long long int x;
unsigned a, d;
__asm__ volatile("rdtsc" : "=a" (a), "=d" (d));
return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
}
double f(std::vector<double>& xs)
{
double sum2 = 0.;
double prev = xs[0];
vector<double>::const_iterator iend = xs.end();
for (vector<double>::const_iterator i = xs.begin() + 1;
i != iend; ++i)
{
sum2 += (prev - (*i)) * (prev - (*i)); // only 1 - with compiler optimization
prev = (*i);
}
return sum2;
}
double f2(double *xs, int N)
{
double sum2 = 0;
for(int i = 0; i < N - 1; i+=1) {
sum2 += (xs[i+1] - xs[i])*(xs[i+1] - xs[i]);
}
return sum2;
}
int main(int argc, char* argv[])
{
int N = 1000001;
std::vector<double> xs;
for(int i=0; i<N; i++) {
xs.push_back(i);
}
unsigned long long int a, b;
a = rdtsc();
std::cout << f(xs) << endl;
b = rdtsc();
cout << b - a << endl;
a = rdtsc();
std::cout << f2(&xs[0], N) << endl;
b = rdtsc();
cout << b - a << endl;
}
Addition can be free when done as x+=a*b. The compiler should be able to figure that out in the first version, if the architecture supports it.
The math is probably happening in parallel with *i
which could be slower.
Do not call xs.end()
at every loop iteration unless you expect the return value to change. If the compiler cannot optimize it out it will dwarf the rest of the loop.
精彩评论