Java - Read text file by chunks_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-21 01:00 出处：网络

I want to read a log file in different chunks to make it multi threaded. The application is going to run in a serverside environment with multiple hard disks.

I want to read a log file in different chunks to make it multi threaded. The application is going to run in a serverside environment with multiple hard disks. After reading into chunks the app is going to process line per line of every chunk.

I've accomplished the reading of every file line line with a bufferedreader and I can make chunks of my file with RandomAccessFile in combination with Mapp开发者_Python百科edByteBuffer, but combining these two isn't easy.

The problem is that the chunk is just cutting into the last line of my chunk. I never have the whole last line of my block so processing this last log-line is impossible. I'm trying to find a way to cut my file into variable-length chunks respecting the end of the lines.

Does anyone have a code for doing this?

You could find offsets in the file that are at line boundaries before you start processing the chunks. Start with the offset by dividing the file size by the chunk number and seek until you find a line boundary. Then feed those offsets into your multi-threaded file processor. Here's a complete example that uses the number of available processors for the number of chunks:

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class ReadFileByChunks {
    public static void main(String[] args) throws IOException {
        int chunks = Runtime.getRuntime().availableProcessors();
        long[] offsets = new long[chunks];
        File file = new File("your.file");

        // determine line boundaries for number of chunks
        RandomAccessFile raf = new RandomAccessFile(file, "r");
        for (int i = 1; i < chunks; i++) {
            raf.seek(i * file.length() / chunks);

            while (true) {
                int read = raf.read();
                if (read == '\n' || read == -1) {
                    break;
                }
            }

            offsets[i] = raf.getFilePointer();
        }
        raf.close();

        // process each chunk using a thread for each one
        ExecutorService service = Executors.newFixedThreadPool(chunks);
        for (int i = 0; i < chunks; i++) {
            long start = offsets[i];
            long end = i < chunks - 1 ? offsets[i + 1] : file.length();
            service.execute(new FileProcessor(file, start, end));
        }
        service.shutdown();
    }

    static class FileProcessor implements Runnable {
        private final File file;
        private final long start;
        private final long end;

        public FileProcessor(File file, long start, long end) {
            this.file = file;
            this.start = start;
            this.end = end;
        }

        public void run() {
            try {
                RandomAccessFile raf = new RandomAccessFile(file, "r");
                raf.seek(start);

                while (raf.getFilePointer() < end) {
                    String line = raf.readLine();
                    if (line == null) {
                        continue;
                    }

                    // do what you need per line here
                    System.out.println(line);
                }

                raf.close();
            } catch (IOException e) {
                // deal with exception
            }
        }
    }
}

You need to let your chunks overlap. If no lines are longer than a block, then a one block overlap is enough. Are you sure you need a multithreaded version? Is the performance of gnu grep not good enough?

The implementation of gnu grep has solved the problem with lines that cross the chunk border. If you aren't bothered with the GNU License you can probably borrow ideas and code from there. It is a very efficient single-threaded implementation.