I am working on optimizing and algorithm that we are preparing to put on a GPU using cuda.
The I/O part reads in from 3 different images, one row at a time. This was right in the middle of the loop for running the filter over the images. I decided to try to pre-load the values that were being generated by removing the I/O to its own loop, and dumping the values out to arrays that held the images, and were used in the calculation.
Now, the problem is, it seems like my application is run开发者_高级运维ning slower with the buffers fully loaded with data, and faster when it was having to go out to disk for new data every iteration.
What could be causing this? Would cache misses from the larger buffers really kill performance that much? Its not a memory issue - with 24GB on this machine it has plenty of ram.
Not sure what else it could be, open to hearing out ideas
@Derek provided the following additional information:
(Run time) ... "is over a minute, compared to 10 - 14 seconds before. I am not doing any specific threading, though I do have some OpenMP pragmas. Moving the I/O outside of the filter loop did not change any of those though. I am running CentOS 5.5. The image size is approx 72MB"
That is a huge difference in run time. Since OpenMP is used we can assume there are multiple threads. Since you're only dealing with 72MB of data I can't see how the difference in I/O time could be that large. We can be positive the read time is smaller than your original 10-14 seconds so unless you have a bug in that portion of the code the extra time is in the filter section. The images are presumably binary? As @Satya has suggested profiling your code or at least adding some timing printouts may help identify where the problem lies.
The "advantage" of reading in the loop may be:
- The OS is giving you some parallelism because it is able to perform some of the I/O in parallel with your computation, e.g. reading ahead. You lose that parallelism when you read everything in advance, effectively blocking while reading.
- The read data is in the cache at the time that your filter is accessing the data. Cache misses can really kill performance if the processing is lightweight relative to the memory bandwidth. It's hard to believe this would make a significant difference in this use case because disk I/O is so much slower than memory.
Given your latest update it does seem more likely we're dealing with #2. Something to watch out for though is the memory access patterns (including all threads), it is possible you are seeing cache thrashing because data that used to adjacent in main memory is now further apart. This could have a large impact because if you have many memory accesses and they are all cache misses you always incur the cost of accessing the data further out which can be an order of magnitude difference.
A solution to this is to arrange your memory in stripes, e.g. n lines from the first image, followed by n lines from the second image, followed by n lines from the third image. IIRC this technique is called "striping". The exact stripe size depends on your CPU but it's something you can experiment with (or start with the same amount of data that used to be read in the inner loop if that's large enough).
E.g.:
stripe_number = 0;
do
{
count = fread(striped_buffer+(STRIPE_SIZE*stripe_number*NUM_IMAGES), 1, STRIPE_SIZE, image_file);
stripe_number++;
} while(count != 0);
Read one file at a time so you're not seeking back and forth on your drive.
Regardless, to maximize performance you probably want to look into using asynchronous/overlapped I/O to have your next bit of image data coming in during the time you are processing the previous bit.
If you're developing under Windows this can give you a start on doing overlapped I/O: http://msdn.microsoft.com/en-us/library/ms686358%28v=vs.85%29.aspx
Once you are doing your I/O in parallel you can figure out if your bottleneck is in the I/O or in the processing. There are different techniques for optimizing those.
Yes, you load your image into L2 cache twice - when you load it from the file and then from the memory. You have to also spend some time to move data from the cache to the memory.
As an option you could try to load some parts like 2-8Mb (depending of your L2 cache size)
In addition to @Guy: answer, I should mention memory mapped files, they have the best parts of both approaches. However, to should take about a second to read 70Mb, so the problem lies somewhere else.
It could be caused by coherence of core caches. I don't know much about this, but if two threads at the same time have write access to the same memory page (or worse, to the same cache line), then their caches have to be synchronized. When you read the whole image at once, then all your processing threads will process them in the same time. Will they write the results in close memory addresses? In case when you read the images line by line, they will spend some time waiting for I/O to complete, so it won't happen so often.
精彩评论