optimizing array loop in c_问答_开发者_运维开发者技术经验分享

I have looked online and in my books but I can't seem to get this. I was asked to optimize a small part of a program. Specifically to take an array and add its contents within a small amount of time, with vi and gcc, without using the built-in optimizer. I have tried loop unrolling and a couple of other optimizations meant for products. Can you please help?

int length = ARRAY_SIZE;
int limit = length-4;
for (j=0; j < limit; j+=5) {
    sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4];
}
for(; j < length; j++){
    sum += array[j];    
}

The array values are non-constant ints and all values have b开发者_如何转开发een initialized.

Create sub-sums which then add up to a sum.

Here's a basic version of what it might look like

for (j=0; j < limit; j+=4) {
    sum1 += array[j];
    sum2 += array[j+1];
    sum3 += array[j+2];
    sum4 += array[j+3];
}
sum = sum1 + sum2 + sum3 + sum4;

This avoids some read-after-write dependencies - that is, the computation of sum2 in each loop iteration need not wait on the results of sum1 to execute, and the processor can schedule both lines in the loop simultaneously.

use sse/mmx set:

__m128i sum;
for (j=0; j < limit; j+=4) {
    sum = _mm_add_epi32(sum, array+j);
}

As it is, the loop is already unrolled by 5.

Since you're disabling the optimizer, all that indexing is going to cost you.

The first loop could be replaced by:

int* p = array;
for (j = 0; j < ARRAY_SIZE - 4; j += 5, p += 5){
  sum += p[0] + p[1] + p[2] + p[3] + p[4];
}

so it's not doing any indexing (multiplying j by sizeof(int) and adding it to the address).

Added: Of course, since ARRAY_SIZE is presumably a known constant, this is probably the fastest code, but you might need to write a code generator (or clever macro) to make it:

sum += array[0];
sum += array[1];
...
sum += array[ARRAY_SIZE - 1];

An example of such a macro is, if ARRAY_SIZE is a power of 2, like 64, you could have:

#define FOO64(i) FOO32(i); FOO32((i)+32)
#define FOO32(i) FOO16(i); FOO16((i)+16)
#define FOO16(i) FOO8(i); FOO8((i)+8)
#define FOO8(i) FOO4(i); FOO4((i)+4)
#define FOO4(i) FOO2(i); FOO2((i)+2)
#define FOO2(i) FOO1(i); FOO1((i)+1)
#define FOO1(i) sum += array[i]

FOO64(0);

You could do the same idea for other powers, like 10.

I'm not sure why you can't use the optimiser since, in my experience, it will usually produce faster code than the vast majority of "wanna-be" manual optimisers :-) In addition, you should make sure that this code is actually a problem area - there's no point optimising code that's already close to maximum speed, nor should you concern yourself with something that accounts for 0.01% of the time taken when there may be code elsewhere responsible for 20%.

Optimisation should be heavily targeted otherwise it's wasted effort.

Any solution other than the naive "just add the numbers together" will most likely have to use special features in the target CPU.

Provided you're willing to take a small hit on each update to the array (and this may not be an option given your "all values have been initialized" comment), you can get the sum in very quick time. Use a "class" to maintain the array and the sum side-by-side. Pseudo-code like:

def initArray (sz):
    allocate data as sz+1 integers
    foreach i 0 thru sz:
        set data[i] to 0

def killArray(data):
    free data

def getArray (data,indx):
    return data[indx+1]

def setArray (data,indx,val):
    data[0] = data[0] - data[indx] + val
    data[indx+1] = val

def sumArray(data):
    return data[0]

should do the trick.

The following complete C program shows a very rough first-cut which you can use as a basis for a more robust solution:

#include <stdio.h>
#include <stdlib.h>

static int *initArray (int sz) {
    int i;
    int *ret = malloc (sizeof (int) * (sz + 1));
    for (i = 0; i <= sz; i++)
        ret[i] = 0;
    return ret;
}

static void killArray(int *data) {
    free (data);
}

static int getArray (int *data, int indx) {
    return data[indx+1];
}

static void setArray (int *data, int indx, int val) {
    data[0] = data[0] - data[indx] + val;
    data[indx+1] = val;
}

static int sumArray (int *data) {
    return data[0];
}

int main (void) {
    int i;
    int *mydata = initArray (10);
    if (mydata != NULL) {
        setArray (mydata, 5, 27);
        setArray (mydata, 9, -7);
        setArray (mydata, 7, 42);
        for (i = 0; i < 10; i++)
            printf ("Element %d is %3d\n", i, getArray (mydata, i));
        printf ("Sum is %3d\n", sumArray (mydata));
    }
    killArray (mydata);
    return 0;
}

The output of this is:

Element 0 is   0
Element 1 is   0
Element 2 is   0
Element 3 is   0
Element 4 is   0
Element 5 is  27
Element 6 is   0
Element 7 is  42
Element 8 is   0
Element 9 is  -7
Sum is  62

As I said, this may not be an option but, if you can swing it, you'll be hard-pressed finding a faster way to get the sum than a single array index extraction.

And, as long as you're implementing a class to do this, you may as well use the first two elements for housekeeping, one for the current sum and one for the maximum index, so that you can avoid out-of-bounds errors by checking indx against the maximum.

You may gain more performance by prefetching data inside the rolled loop.
I'll build on Drew's answer:

register int value1, value2, value3, value4;
or (j=0; j < limit; j+=4)
{
    // Prefetch the data
    value1 = array[j];
    value2 = array[j + 1];
    value3 = array[j + 2];
    value4 = array[j + 4];

    // Use the prefetched data
    sum1 += value1;
    sum2 += value2;
    sum3 += value3;
    sum4 += value4;
}
sum = sum1 + sum2 + sum3 + sum4;

The idea here is to have the processor load contiguous data into it's cache, then operate on the cached data. In order for this to be effective, the compiler must not optimize-away the prefetching; this could be performed by declaring the temporary variables as volatile. I don't know if volatile can be combined with register.

Search the web for "Data driven design".

Since five seem to be the number of additions to do at a time in the sample, I do it here as well. Normally you do it with a power of 2 as Drew Hoskins suggested. By getting the modulo right at the beginning and stepping in the other direction fewer values might be needed. Computing in a different order is something that is often profitable in scientific computing, not just for indexing. To see if and how good an optimization is, testing is essential.

int sum1, sum2, sum3, sum4;

for(j = ARRAY_SIZE; j%5; j--){
    sum += array[j]; 
}
sum1 = sum2 = sum3 = sum4 = 0;
for (; j; j-=5) {
    sum += array[j-1];
    sum1 += array[j-2];
    sum2 += array[j-3];
    sum3 += array[j-4];
    sum4 += array[j-5];
}
sum += sum1+sum2+sum3+sum4;

One solution would be to maintain a sum at all times. You would, of course, have to do update it every time you change the values in the array, but if that doesn't happen that often might be worth the trouble.