开发者

Sorting a very large text file in Java

开发者 https://www.devze.com 2022-12-13 12:49 出处:网络
I have a large text file I need to sort in Java. The format is: word [tab] frequency [new line] The algorithm for sorting is:

I have a large text file I need to sort in Java. The format is:

word [tab] frequency [new line]

The algorithm for sorting is:

  • Read some of the file, filtering for purlely alphabetic words.
  • Once you have X number of alphabetic words, call Collections.sort and write the result to a file.
  • Repeat until you have finished reading the file.
  • Start reading two sorted files, comparing line by line for the word with higher frequency, and writing at the same time to a new file as to not load much into your memory
  • Repeat until all files are merged into one large file

Right now I've divided the large file into smaller ones (sorted by descending frequency) with 10,000 lines each. I know I need to somehow merge these files back together, but I'm not sure how to go about this.

I've created a LinkedList to keep track of all the files created. The algorithm says to compare each line in the two files, except I've tried a case where , say file1 = 8,6,5,3,1 and file2 = 9,8,8,8,8. Th开发者_Python百科en if I compare them line by line I would get file3 = 9,8,8,6,8,5,8,3,8,1 which is incorrectly sorted (they should be in decreasing order).

I think I'm misunderstanding some part of the algorithm. If someone could point out what I should do instead, I'd greatly appreciate it. Thanks.

edit: Yes this is an assignment. We aren't allowed to increase memory unfortunately :(


You have the right idea, but with a small error. As you read the lines from the 2 files, you shouldn't be outputting both lines, because the next line in the file with the greater number might still be greater than the first line in the file with the smaller number (as it is in your test case).

So, it's quite simply this:

Read a line from each file to start.
Then repeat this:
.The line with the highest value is written to a new file
.Read another line from that file only

This is the basic algorithm, but of course you have to allow for what happens when one of the files runs out (in which case you just read lines and output from the remaining file - whether this is a separate loop or part of the same loop is up to you - I would look at what the code looks like before making that decision).


If the file is too large to fit into memory, use a database. Something like MySQL might be too heavyweight, but there are embeddable databases you can use in java.

One of them is berkely DB which is a Key/value database system.

Apache Derby is a relational database system that lets you use SQL.

If you already know SQL, derby might be the easiest way to go. I haven't used it myself.

0

精彩评论

暂无评论...
验证码 换一张
取 消