Optimize indexed array summation_问答_开发者_运维开发者技术经验分享

I have the following C++ code:

const int N = 1000000开发者_运维百科
int id[N]; //Value can range from 0 to 9
float value[N];

// load id and value from an external source... 

int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
    ++size[id[i]];
    sum[id[i]] += value[i];
}

How should I optimize the loop?

I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.

This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when

id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0

in this case, the obvious approach (pseudocode here)

x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);

won't work either!

You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.

One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:

int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
  int id0 = id[i*2+0], id1 = id[i*2+1];
  ++size[id0];
  ++size2[id1];
  sum[id0] += value[i*2+0];
  sum2[id1] += value[i*2+1];
}

// if N was odd, process last element
if (N & 1) {
  ++size[id[N]];
  sum[id[N]] += value[N];
}

// add partial sums together
for (int i = 0; i < 10; i++) {
  size[i] += size2[i];
  sum[i] += sum2[i];
}

Whether this helps or not depends on the target CPU though.

Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.

register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}

The MSDN docs state this about register:

The register keyword specifies that the variable is to be stored in a machine register.. Microsoft Specific

The compiler does not accept user requests for register variables; instead, it makes its own register choices when global register-allocation optimization (/Oe option) is on. However, all other semantics associated with the register keyword are honored.

Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:

int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
    index = 2 * i;
    ++size[id[index]];
    sum[id[index]] += value[index];
    index++;
    ++size[id[index]];
    sum[id[index]] += value[index];
}

which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.

Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.

Do not optimise until you KNOW where the bottleneck is.

Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:

You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.

In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.

Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.

Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.

See here (Page 95 in the PDF) for more info from Intel.

This computation is trivially parallelizable; just add

#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)

immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.

If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:

for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)

for (k += size[j], tmp = 0; i < k; i++)