I am in the process of writing an application that processes a huge number of integers from a binary file (up to 50 meg). I need to do it as quickly as possible and the main performance issue is the disk access time, since I make a large number of reads from the disk, optimizing read time would improve performance of the app in general.
Up until now I thought that the fewer blocks I split my file into (i.e. the fewer reads I have / the larger the read size is) the faster my app should work. This is because HDD is very slow on seeking i.e. locating the beginning of the block due to its mechanical nature. However, once it locates the beginning of the block you asked it to read off it should perform the actual read fairly quickly.
Well, that was up until I ran this test:
Old test removed, had issues due to HDD Caching
NEW TEST (HDD Cache doesn't help here since the file is too big (1gb) and I access random locations within it):
int mega = 1024 * 1024;
int giga = 1024 * 1024 * 1024;
byte[] bigBlock = new byte[mega];
int hundredKilo = mega / 10;
byte[][] smallBlocks = new byte[10][hundredKilo];
String location = "C:\\Users\\Vladimir\\Downloads\\boom.avi";
RandomAccessFile raf;
FileInputStream f;
long start;
long end;
int position;
java.util.Random rand = new java.util.Random();
int bigBufferTotalReadTime = 0;
int smallBufferTotalReadTime = 0;
for (int j = 0; j < 100; j++)
{
position = rand.nextInt(giga);
raf = new RandomAccessFile(location, "r");
raf.seek((long) position);
f = new FileInputStream(raf.getFD());
start = System.currentTimeMillis();
f.read(bigBlock);
end = System.currentTimeMillis();
bigBufferTotalReadTime += end - start;
f.close();
}
for (int j = 0; j < 100; j++)
{
position = rand.nextInt(giga);
raf = new RandomAccessFile(location, "r");
raf.seek((long) position);
f = new FileInputStream(raf.getFD());
start = System.currentTimeMillis();
for (int i = 0; i < 10; i++)
{
f.read(smallBlocks[i]);
}
end 开发者_JAVA技巧= System.currentTimeMillis();
smallBufferTotalReadTime += end - start;
f.close();
}
System.out.println("Average performance of small buffer: " + (smallBufferTotalReadTime / 100));
System.out.println("Average performance of big buffer: " + (bigBufferTotalReadTime / 100));
RESULTS: Average for small buffer - 35ms Average for large buffer - 40ms ?! (Tried on linux and windows, in both cases larger block size results in longer read time, why?)
After running this test for many many times I have realised that for some magical reason reading one big block takes on average longer than reading 10 blocks of smaller size sequentially. I thought that it might have been a result of Windows being too smart and trying to optimize something in its file system, so I ran the same code on Linux and to my surprise I got the same result.
I have no clue as to why this is happening, could anyone please give me a hint? Also what would be the best block size in this case?
Kind Regards
After you read the data the first time, the data will be in disk cache. The second read should be much faster. You need to run the test you think is faster first. ;)
If you have 50 MB of memory, you should be able to read the entire file at once.
package com.google.code.java.core.files;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
public class FileReadingMain {
public static void main(String... args) throws IOException {
File temp = File.createTempFile("deleteme", "zeros");
FileOutputStream fos = new FileOutputStream(temp);
fos.write(new byte[50 * 1024 * 1024]);
fos.close();
for (int i = 0; i < 3; i++)
for (int blockSize = 1024 * 1024; blockSize >= 512; blockSize /= 2) {
readFileNIO(temp, blockSize);
readFile(temp, blockSize);
}
}
private static void readFile(File temp, int blockSize) throws IOException {
long start = System.nanoTime();
byte[] bytes = new byte[blockSize];
int r;
for (r = 0; System.nanoTime() - start < 2e9; r++) {
FileInputStream fis = new FileInputStream(temp);
while (fis.read(bytes) > 0) ;
fis.close();
}
long time = System.nanoTime() - start;
System.out.printf("IO: Reading took %.3f ms using %,d byte blocks%n", time / r / 1e6, blockSize);
}
private static void readFileNIO(File temp, int blockSize) throws IOException {
long start = System.nanoTime();
ByteBuffer bytes = ByteBuffer.allocateDirect(blockSize);
int r;
for (r = 0; System.nanoTime() - start < 2e9; r++) {
FileChannel fc = new FileInputStream(temp).getChannel();
while (fc.read(bytes) > 0) {
bytes.clear();
}
fc.close();
}
long time = System.nanoTime() - start;
System.out.printf("NIO: Reading took %.3f ms using %,d byte blocks%n", time / r / 1e6, blockSize);
}
}
On my laptop prints
NIO: Reading took 57.255 ms using 1,048,576 byte blocks
IO: Reading took 112.943 ms using 1,048,576 byte blocks
NIO: Reading took 48.860 ms using 524,288 byte blocks
IO: Reading took 78.002 ms using 524,288 byte blocks
NIO: Reading took 41.474 ms using 262,144 byte blocks
IO: Reading took 61.744 ms using 262,144 byte blocks
NIO: Reading took 41.336 ms using 131,072 byte blocks
IO: Reading took 56.264 ms using 131,072 byte blocks
NIO: Reading took 42.184 ms using 65,536 byte blocks
IO: Reading took 64.700 ms using 65,536 byte blocks
NIO: Reading took 41.595 ms using 32,768 byte blocks <= fastest for NIO
IO: Reading took 49.385 ms using 32,768 byte blocks <= fastest for IO
NIO: Reading took 49.676 ms using 16,384 byte blocks
IO: Reading took 59.731 ms using 16,384 byte blocks
NIO: Reading took 55.596 ms using 8,192 byte blocks
IO: Reading took 74.191 ms using 8,192 byte blocks
NIO: Reading took 77.148 ms using 4,096 byte blocks
IO: Reading took 84.943 ms using 4,096 byte blocks
NIO: Reading took 104.242 ms using 2,048 byte blocks
IO: Reading took 112.768 ms using 2,048 byte blocks
NIO: Reading took 177.214 ms using 1,024 byte blocks
IO: Reading took 185.006 ms using 1,024 byte blocks
NIO: Reading took 303.164 ms using 512 byte blocks
IO: Reading took 316.487 ms using 512 byte blocks
It appears that the optimal read size may be 32KB. Note: as the file is entirely in disk cache this may not be the optimal size for a file which is read from disk.
As noted, your test is hopelessly compromised by reading the same data for each.
I could spew on, but you'll probably get more out of reading this article, then looking at this example of how to use FileChannel.
精彩评论