开发者

How to sort 100GB worth of strings

开发者 https://www.devze.com 2022-12-26 00:56 出处:网络
Given a harddrive with 120GB, 100 of which are filled with the strings of length 256 and 2 GB Ram how do I sort开发者_如何学C those strings in Java most efficiently?

Given a harddrive with 120GB, 100 of which are filled with the strings of length 256 and 2 GB Ram how do I sort开发者_如何学C those strings in Java most efficiently? How long will it take?


A1. You probably want to implement some form of merge-sort.

A2: Longer than it would if you had 256GB RAM on your machine.

Edit: stung by criticism, I quote from Wikipedia's article on merge sort:

Merge sort is so inherently sequential that it is practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not depend on the number of data elements.

For the same reason it is also useful for sorting data on disk that is too large to fit entirely into primary memory. On tape drives that can run both backwards and forwards, merge passes can be run in both directions, avoiding rewind time.


Here is how I'd do it:

Phase 1 is to split the 100Gb into 50 partitions of 2Gb, read each of the 50 partitions into memory, sort using quicksort, and write out. You want the sorted partitions at the top end of the disc.

Phase 2 is to then merge the 50 sorted partitions. This is the tricky bit because you don't have enough space on the disc to store the partitions AND the final sorted output. So ...

  1. Do a 50-way merge to fill the first 20Gb at the bottom end of disc.

  2. Slide the remaining data in the 50 partitions to the top to make another 20Gb of free space contiguous with the end of the first 20Gb.

  3. Repeat steps 1. and 2. until completed.

This does a lot of disc IO, but you can make use of your 2Gb of memory for buffering in the copying and merging steps to get data throughput by minimizing the number of disc seeks, and do large data transfers.

EDIT - @meriton has proposed a clever way to reduce copying. Instead of sliding, he suggests that the partitions be sorted into reverse order and read backwards in the merge phase. This would allows the algorithm to release disc space used by partitions (phase 2, step 2) by simply truncating the partition files.

The potential downsides of this are increased disk fragmentation, and loss of performance due reading the partitions backwards. (On the latter point, reading a file backwards on Linux / UNIX requires more syscalls, and the FS implementation may not be able to do "read-ahead" in the reverse direction.)

Finally, I'd like to point out that any theoretically predictions of the time taken by this algorithm (and others) are largely guesswork. The behaviour of these algorithms on a real JVM + real OS + real discs are just too complex for "back for the envelope" calculations to give reliable answers. A proper treatment would require actual implementation, tuning and benchmarking.


I am basically repeating Krystian's answer, but elaborating:

Yes you need to do this more-or-less in place, since you have little RAM available. But naive in-place sorts would be a disaster here just due to the cost of moving strings around.

Rather than actually move strings around, just keep track of which strings should swap with which others and actually move them, once, at the end, to their final spot. That is, if you had 1000 strings, make an array of 1000 ints. array[i] is the location where string i should end up. If array[17] == 133 at the end, it means string 17 should end up in the spot for string 133. array[i] == i for all i to start. Swapping strings, then, is just a matter of swapping two ints.

Then, any in-place algorithm like quicksort works pretty well.

The running time is surely dominated by the final move of the strings. Assuming each one moves, you're moving around about 100GB of data in reasonably-sized writes. I might assume the drive / controller / OS can move about 100MB/sec for you. So, 1000 seconds or so? 20 minutes?

But does it fit in memory? You have 100GB of strings, each of which is 256 bytes. How many strings? 100 * 2^30 / 2^8, or about 419M strings. You need 419M ints, each is 4 bytes, or about 1.7GB. Voila, fits in your 2GB.


Sounds like a task that calls for External sorting method. Volume 3 of "The Art of Computer Programming" contains a section with extensive discussion of external sorting methods.


I think you should use BogoSort. You might have to modify the algorithm a bit to allow for inplace sorting, but that shouldn't be too hard. :)


You should use a trie (aka: a prefix tree): to build a tree-like structure that allows you to easily walk through your strings in an ordered manner by comparing their prefixes. In fact, you don't need to store it in memory. You can build the trie as a tree of directories on your file system (obviously, not the one which the data is coming from).


AFAIK, merge-sort requires as much free space as you have data. This may be a requirement for any external sort that avoids random access, though I'm not sure of this.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号