开发者

Reverse massive text file in Java

开发者 https://www.devze.com 2022-12-28 15:29 出处:网络
What would be the best approach to reverse a large text file that is uploaded asyn开发者_如何学Gochronously to a servlet that reverses this file in a scalable and efficient way?

What would be the best approach to reverse a large text file that is uploaded asyn开发者_如何学Gochronously to a servlet that reverses this file in a scalable and efficient way?

  • text file can be massive (gigabytes long)
  • can assume mulitple server/clustered environment to do this in a distributed manner.
  • open source libraries are encouraged to consider

I was thinking of using Java NIO to treat file as an array on disk (so that I don't have to treat the file as a string buffer in memory). Also, I am thinking of using MapReduce to break up the file and process it in separate machines.


If it is uploaded to you and you can get the length at the beginning, you could just create an empty full-sized file up front and write to it starting from the back and working your way to the front using seek

You'd probably want to define a block size (like 1K?) and reverse that much in memory before writing it out to the file.


That's a pretty tough task. If you can ensure that the HTTP Content-Length and Content-Type headers are present in the upload request (or in the multipart body when it's a multipart/form-data request), then it would be an easy job with help of RandomAccessFile. The content length is mandatory so that the RandomAccessFile knows how long the file will be and write the character at the position you want it to be. The character encoding (which is usually present as an attribute of the content type header) is mandatory to know how many bytes a character will take into account (because RandomAccessFile is byte based and for example UTF-8 encoding is variable-byte-length).

Here's a kickoff example (leaving obvious exception handling aside):

package com.stackoverflow.q2725897;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.RandomAccessFile;
import java.io.Reader;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;

public class Test {

    public static void main(String... args) throws Exception {

        // Stub input. You need to gather it yourself from your sources.
        File file = new File("/file.txt");
        long length = file.length(); // Get it from HTTP request header using file upload API in question (Commons FileUpload?).
        String encoding = "UTF-8"; // Get it from HTTP request header using file upload API in question (Commons FileUpload?).
        InputStream content = new FileInputStream(file); // Get it from HTTP request body using file upload API in question (Commons FileUpload?).

        // Now the real job.
        Reader input = new InputStreamReader(content, encoding);
        RandomAccessFile output = new RandomAccessFile(new File("/filereversed.txt"), "rwd");
        CharsetEncoder encoder = Charset.forName(encoding).newEncoder();

        for (int data; (data = input.read()) != -1;) {
            ByteBuffer bytes = encoder.encode(CharBuffer.wrap(new char[] { (char) data }));
            length -= bytes.limit();
            output.seek(length);
            output.write(bytes.array());
        }

        // Should actually be done in finally.
        input.close();
        output.close();
    }

}

If those headers are not present (especially Content-length is important), then you'll really need to store it on disk first until end of stream and then re-read and reverse it the same way with help of RandomAccessFile.

Update: it would actually be tougher than it look like. Is the character encoding of the input always guaranteed to be the same? If so, what one would it be? Also, what would you like to do with for example surrogate characters and newlines? The above example doesn't take that into account correctly. But it at least gives the base idea.


Here is my way of reversing a file, without using memory.

import java.io.*;
import java.nio.charset.StandardCharsets;

public static void createReverseFile(String filePathToBeReversed) {
    String fileName = filePathToBeReversed.split("/")[filePathToBeReversed.split("/").length - 1];
    try {
        File reversedFile = new File(filePathToBeReversed.substring(0, filePathToBeReversed.lastIndexOf("/") + 1) + "reverse" + fileName.substring(0, 1).toUpperCase() + fileName.substring(1));
        reversedFile.delete();
        reversedFile.createNewFile();
        RandomAccessFile raf = new RandomAccessFile(reversedFile, "rw");
        long rafPointer = new File(filePathToBeReversed).length();
        BufferedReader br = new BufferedReader(new FileReader(filePathToBeReversed));
        int lineCount = 0;
        for (String line;(line = br.readLine()) != null;) {
            System.out.println("Reversing line " + lineCount++);
            line += "\r\n";
            raf.seek(rafPointer -= line.length());
            System.out.println(rafPointer);
            raf.write(line.getBytes(StandardCharsets.UTF_8), 0, line.length());
        }
        raf.close();
        br.close();
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}


Save it in manageable chunks to disk as they come in, and then read the chunks backward when needed and present the content backwards.

Would 1 Mb be a reasonable size, given the amount available to a normal java application these days?


In Map-Reduce paradigm file can be broken into small partitions and each partition can be stored into collection object ,which can be reversed easily , and in reduce phase each reversed output can again merged together. for e.g in spark-scala code should be something like this.

val content = sc.textFile(textfile,numpartitioner)
val op = content.mapPartitions(partitioner, true)

def partitioner(content: Iterator[String]): Iterator[String] = {

    val reverse = content.map { x => x.reverse }
    val reverseContent = reverse.toList.reverse
    reverseContent.toIterator 
 }
0

精彩评论

暂无评论...
验证码 换一张
取 消