Any code tips for speeding up random reads from a Java FileChannel?_问答_开发者

I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.

I create the FileChannel like this...

f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();

I then use a private ByteBuffer the size of a double to read from it

private ByteBuffer _double_bb = ByteBuffer.allocate(8);

and my reading code looks like this

pub开发者_开发问答lic double GetValue(long lRow, long lCol) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long position = idx * BLOCK_SIZE;
    double d = 0;
    try 
    {
        _double_bb.position(0);
        _ioChannel.read(_double_bb, position);
        d = _double_bb.getDouble(0);
    } 

    ...snip...

    return d;
}

and I write to it like this...

public void SetValue(long lRow, long lCol, double d) 
{
    long idx = TriangularMatrix.CalcIndex(lRow, lCol);
    long offset = idx * BLOCK_SIZE;
    try 
    {
        _double_bb.putDouble(0, d);
        _double_bb.position(0);
        _ioChannel.write(_double_bb, offset);
    } 

    ...snip...

}

The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.

So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.

Thanks in advance

As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.

This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.

So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.

Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.

Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().

Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).

You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.

Presumably if we can reduce the number of reads then things will go more quickly.

3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.

Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.

Or, if you have the capacity, read the whole thing into memory, in at the start of processing.

Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).
How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.

May be This article helps you ...

You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.

The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.