Pre-sorting analysis algorithm?_问答_开发者_运维开发者技术经验分享

It's a well-known isssue with Quicksort that when the data set is in or almost in sort order, performance degrades horribly. In this case, Insertion Sort, which is normally very slow, is easily the best choice. The question is knowing when to use which.

Is there an algorithm available to run through a data se开发者_运维技巧t, apply a comparison factor, and return a report on how close the data set is to being in sort order? I prefer Delphi/Pascal, but I can read other languages if the example isn't overly complex.

As you'd expect quite a lot of thought goes into this. The median-of-three technique means that quicksort's worst case behaviour doesn't occur for sorted data, but instead for less obvious cases.

Introsort is quite exciting, since it avoids quicksort's quadratic worst case altogether. Instead of your natural question, "how do I detect that the data is nearly-sorted", it in effect asks itself as it's going along, "is this taking too long?". If the answer is yes, it switches from quicksort to heapsort.

Timsort combines merge sort with insertion sort, and performs very well on sorted or reverse-sorted data, and on data that includes sorted or reverse-sorted subsets.

So probably the answer to your question is, "you don't need a pre-pass analysis, you need an adaptive sort algorithm".

There's also SmoothSort, which is apparently quite tricky to implement, but it varies between O(N log N) to O(N) depending on how sorted the data is to start with.

http://en.wikipedia.org/wiki/Smoothsort

Long tricky PDF: http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796a.PDF

However, if your data is truly huge and you have to access it serially, mergesort is probably the best. It's always O(N log N) and it has excellent 'locality' properties.

I've not heard of any pre-sorting analysis but my opinion is that if you are going to go through the dataset to analyze it then you are already cutting into performance of your overall sorting time.

One possible solution is to take first, last and the middle element in the current sort range (during the QuickSort operation) and chose the middle one as the pivot element.

To fully analyze for the purpose of deciding which algorithm to use, you are going to do nearly the work of sorting. You could do something like check the values at a small percentage of random but increasing indexes (ie analyze a small sample of the items).

You would still have to run through all records to determine if its sorted or not, so to improve performance, start with your first record and run though the rest until you either notice something not properly sorted, or reach the end of the list. If you find miss then only sort items from that position to the end (since the beginning of the list is already sorted).

At each item in the second part, see if the item is < than the last element in the first part and if so use an insertion sort into ONLY the first part. Otherwise Quicksort against all other items in the second part. This way the sort is optimized for the specific case.

QuickSort beng a problem only when the data set is huge and already mostly sorted, I would use the following heuristics (pending a full blown solution):

Don't bother if data set size is below threshold.
If you have a quick (indexed) access to records(items) take a sample with 1 record in every N records and see if they are already sorted. Should be quick enough for a small sample and you can then decide to use quick sort or not.

To make a conceptual point that people haven't yet made: Quicksort is a common-sense divide-and-conquer algorithm with an obvious bug in rare cases. Suppose that you want to sort a stack of student papers. (Which I have to do with some regularity.) In the quicksort algorithm, you pick some paper, the pivot. Then divide the other papers according to whether they are before or after the pivot. Then repeat that with the two subpiles. What's the bug? The pivot could be a name that is near one end of the list instead of in the middle, so that it doesn't accomplish much to divide it into two piles.

Merge sort is another divide-and-conquer algorithm that works in a different order. You can merge two sorted lists in linear time. Divide the papers into two equal or nearly equal piles, then recursively sort each one, then merge. Merge sort doesn't have any bugs. One reason that quicksort is more popular than merge sort is historical: Quicksort is fast (usually) and it works without any extra memory. But these days, it can be more important to save comparisons than to save memory, and the actual rearrangement is often abstracted by permuting pointers. If things had always been that way, then I suspect that merge sort would simply have been more popular than quicksort. (And maybe adding "quick" to the name was good salesmanship.)