Most of you will realize this, but to me it came a bit as a surprise: it's way faster to sort (for example) 96 files each size 4Mb than 6 files of 64Mb using mergesort (holding the total amount of information constant). I stumbled upon this finding by accident. So this begs the question, what is the optimal input file size for a mergesort?
I assume that there will be a curvi-linear shaped relationship between sorting time (y axis) and number of files (x axis). Is there an algorithm, i开发者_如何学Pythons it more rule of thumb or just trying a couple of different file sizes? Obvious factors that will impact this are: * max number of files that OS can open simultaneously.
* read / write speed of harddiskAny references are welcome!
If your sorting involves moving files, then the conventional measures for "fastest" sort algorithm don't really apply. For moving files around, a faster sort algorithm will consist of minimizing the number of file writes.
Selection sort can be used and has very close to the minimum number of swaps possible, but again, in the worst case, each file has to be written twice: Once when it's swapped out of the way to make place for the file that belongs there, and once swapped into the place it's supposed to be when its time comes.
There is an algorithm which performs at most n+1 assignments. A 'swap' (which is what most sorting algorithms use) involves three assignments, (using a temp variable). This works pretty much by doing a selection sort without actually swapping anything. By either writing each selected item to new memory, or saving the sort order in memory and then reorganizing files in the same memory space after the fact (defragmentation style). This algorithm would really be minimal in terms of data copying. This is ideal when copying items is expensive (sorting data on disk).
精彩评论