This code (A) executes much faster (10 times) then the second one:
for(int w=0; w<width; w++) {
for(int h=1; h<height; h++) {
image[h][w] = (1-a)*image[h][w] + a*image[h-1][w];
}
}
Second one:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
Why is that? it is the same to go through all of the pixels in the image in either horizontal or vertical 开发者_如何学运维direction.
Is there a way to speed up the second one?
Thanks in advance.
This has to do with locality of reference. If you access the elements in the same order as they are stored in memory, this will be much faster than accessing them in a strided pattern, as the memory caches and memory bandwidth will be utilized far more effectively.
The above would explain the second version being faster than the first one, and this is exactly what happens on my box:
aix@aix:~$ time ./ver1
real 0m29.421s
aix@aix:~$ time ./ver2
real 0m2.198s
Here is the code I use to allocate the array:
double a = 0.5;
int width = 2048;
int height = 2048;
double* data = new double[height * width];
double** image = new double*[height];
for (int i = 0; i < height; i++) {
image[i] = data + i * width;
}
Version 1 times the following loop:
for (int iter = 0; iter < 100; iter++) {
for(int w=0; w<width; w++) {
for(int h=1; h<height; h++) {
image[h][w] = (1-a)*image[h][w] + a*image[h-1][w];
}
}
}
Version 2 loop:
for (int iter = 0; iter < 100; iter++) {
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
}
Compiled with g++
4.4.3 with -O3
and run on a Xeon box of some description (64-bit Ubuntu).
If you are still 100% sure that you're seeing the opposite effect, there must be something fundamentally different about what you're doing compared to what I'm doing. It might help if you told us the dimensions of your image and how exactly it gets allocated (in order to help establish the memory layout).
aix is right about locality of reference. To be more explicit, this is because of memory hierarchy.
When you first access an element, it's probably a cache miss. An entire cache line is loaded, then the read/write happens.
Depending on which direction you traverse the array in, the next access will either be at location i+1 or i+N. i+1 is likely to be in the same cache line but i+N will generally be in another cache line, demanding another large fetch.
For small N, the whole thing ends up in cache and it doesn't matter much about direction. For suitably large N, not all of the array can fit into the fastest (and smallest) part of the cache, so the cache line containing element i may get dropped before you access i+M*N and have to be reloaded before accessing i+1.
To make it as fast as possible you have to get particular about the CPU architecture. Some are more alignment-sensitive than others. Some prefer you to touch each cache line once (up to capacity) and then do the copying afterwards. Timeslicing and processor sharing mess things up, of course.
精彩评论