Im trying to write optimized code for accesing image pixels and need to make a for loop super fast without going down to assembly level. Further more the indexing is done along the rows to minimize cache misses.
This is what I have:
for (indr=0;indr<(height-1)*width;indr+=width) {
for (indc=0;indc<width;indc++){
I[indr+indc]= dostuff ;
}
}
I cant make it a single loop because the "dostuff" includes accessing elements that arent on the same row.
Is there a faster way to do this?
EDIT Okay, because my previous post was slightly unclear im adding here the full code. Its pretty unreadable but the general idea is that Im performing a convolution with a simple box using an integral image. The image is first padded with ws+1 zeros on the left and bottom and ws zeros on the right and top. It is then made into an integral image Ii. The following function takes the integral image and extracts the convolution where the result Ic is the same size as the original image.
void convI(float *Ic,float *Ii,int ws, int width, int height)
{
int W=width+ws*2+1,indR;
int H=height+ws*2+1,indC;
int w=width, indr;
int h=height, indc;
int jmpA=W*(ws+1),jmpC=W*ws,jmpB=ws+1,jmpD=ws;
for (indR=W*(ws+1),indr=0;indr<width*(height-1);indR+=W,indr+=width) {
for (indC=ws+1,indc=0;indc<width;indC++,indc++){
//Performs I[indA]+I[indD]-I[indB]-I[indC];
Ic[indr+indc]=
Ii[indR-jmpA+indC-jmpB]+
Ii[indR+jmpC+indC+jmpD]-
Ii[indR+jmpC+indC-jmpB]-
Ii[indR-jmpA+indC+jmpD];
开发者_运维技巧 }
}
}
So thats the "dostuff" part. The loop is sluggish.
There is not much reason that other code would result in better performance than the one you gave, if you have all optimization levels on.
Why do you suspect the loop itself to be a bottleneck? There is not much that can be said without knowing what you are actually doing. Benchmark your code and look at the assembler that this produces if you have doubts.
Edit: After you showed the inner part of your loop.
There is a little bit of potential of putting expressions of your index computations as much as possible outside of the loops. Since it is intermixed with the loop variables, this probably can't be optimized as is should. (Or just reorder the computations of the indices, such that the compiler may see it and may precompute as much as possible.)
The most chances are that performance difficulties come from the access of your vectors. If you manage to compute your indices better, this might also improve, because the compiler/system will actually see that you access your vectors in a regular pattern.
If this doesn't help, reorganize your loop such that the load of your vectors is incremental and not the store. Loads always have to wait until the data is there to perform the operation, stores are less sensible to that.
You can unroll the innermost loop. You will lose readability, but the CPU's cache and the prefetch queue will do a better job. Although this is always true I don't know how much speed you will gain.
You can declare both indc
and indr
as register variables and try avoiding recalculating (height-1)*width
, keep it in a temporary variable instead. You know, multiplications eat a lot of clock cycles...
Unless you want to use vectorizing instructions like SSE, there's not much that can be done.
What you have looks fine. If you want to avoid going into assembly, it's best to keep simple loops simple. GCC is smart. If you're clear about what you want your code doing it generally does a good job optimizing it. However, if you do fancy tricks that aren't common in production code, it might have trouble deducing what you "really mean".
Depending on what dostuff
actually does, you might find some win in caching I[indr+indc]
in a temporary so your code looks something like...
char t = I[indr+indc];
// do stuff
I[indr+indc] = t;
This code will not perform worse (I assume you have at least the basic optimizations turned on), but it might perform better if your do stuff
is fancy enough (I can elaborate if you want).
And don't listen to the other guys lifting simple math out of loops. There's really no need. If you look at the assembly generated at -O1, you'll see this is done for you every time. It's one of the cheapest optimizations to make.
There MAY be a win in lifting the height-1
in the outer loop to an assignment before the loop. But, then, I suspect that a normal compiler these days would do that as a standard optimization. It may also be that having another pointer, set to I[indr] and then indexing off that may be a small win.
Both of these would require some pretty careful benchmarking to note.
// DragonLord style:
float *ic_p = I + (width * height) - 1; // fencepost
// Start at the end, and work backwards
// assumes I is 0-based and wraps, is contiguous
for (indr=(height -1) * width; indr>=0; indr-=width ) {
// Sadly cannot test on indr -= width here
// as the 0 pass is needed for the loop
for (indc=width; indc--; ){
// Testing on postdecrement
// allows you to use the 0 value one last time before testing it FTW
// indr and indc are both 0-based inside the loop for you
// e.g. indc varies from (width-1) down to 0
// due to postdecrement before usage
printf( "I[ %d + %d ] == %f \n", indr, indc, *ic_p );
// always use pointers in C/C++ for speed, we are not Java
*ic_p-- = dostuff ;
}
}
performance may be slightly improved by counting down from height towards 0 if you don't need to use indr inside the loop, or predecrementing instead of postdecrementing indc if you can get by with a 1's-based indc, in which case indc should initialize at (width +1):
for (indc=(width+1); --indc; ){
精彩评论