What's faster: inserting into a priority queue, or sorting retrospectively?_问答_开发者

What's faster: inserting into a priority queue, or sorting retrospectively?

I am generating some items that I need to be sorted at the end. I was wond开发者_StackOverflow中文版ering, what is faster in terms of complexity: inserting them directly in a priority_queue or a similar data structure, or using a sort algorithm at end?

Testing is the best way to answer this question for your specific computer architecture, compiler, and implementation. Beyond that, there are generalizations.

First off, priority queues are not necessarily O(n log n).

If you have integer data, there are priority queues which work in O(1) time. Beucher and Meyer's 1992 publication "The morphological approach to segmentation: the watershed transformation" describes hierarchical queues, which work quite quickly for integer values with limited range. Brown's 1988 publication "Calendar queues: a fast 0 (1) priority queue implementation for the simulation event set problem" offers another solution which deals well with larger ranges of integers - two decades of work following Brown's publication has produced some nice results for doing integer priority queues fast. But the machinery of these queues can become complicated: bucket sorts and radix sorts may still provide O(1) operation. In some cases, you may even be able to quantize floating-point data to take advantage of an O(1) priority queue.

Even in the general case of floating-point data, that O(n log n) is a little misleading. Edelkamp's book "Heuristic Search: Theory and Applications" has the following handy table showing the time complexity for various priority queue algorithms (remember, priority queues are equivalent to sorting and heap management):

What's faster: inserting into a priority queue, or sorting retrospectively?

As you can see, many priority queues have O(log n) costs not just for insertion, but also for extraction, and even queue management! While the coefficient is generally dropped for measuring the time complexity of an algorithm, these costs are still worth knowing.

But all these queues still have time complexities which are comparable. Which is best? A 2010 paper by Cris L. Luengo Hendriks entitled "Revisiting priority queues for image analysis" addresses this question.

What's faster: inserting into a priority queue, or sorting retrospectively?

In Hendriks' hold test, a priority queue was seeded with N random numbers in the range [0,50]. The top-most element of the queue was then dequeued, incremented by a random value in the range [0,2], and then queued. This operation was repeated 10^7 times. The overhead of generating the random numbers was subtracted from the measured times. Ladder queues and hierarchical heaps performed quite well by this test.

The per element time to initialize and empty the queues were also measured---these tests are very relevant to your question.

What's faster: inserting into a priority queue, or sorting retrospectively?

As you can see, the different queues often had very different responses to enqueueing and dequeueing. These figures imply that while there may be priority queue algorithms which are superior for continuous operation, there is no best choice of algorithm for simply filling and then emptying a priority queue (the operation you're doing).

Let's look back at your questions:

What's faster: inserting into a priority queue, or sorting retrospectively?

As shown above, priority queues can be made efficient, but there are still costs for insertion, removal, and management. Insertion into a vector is fast. It's O(1) in amortized time, and there are no management costs, plus the vector is O(n) to be read.

Sorting the vector will cost you O(n log n) assuming that you have floating-point data, but this time complexity's not hiding things like the priority queues were. (You have to be a little careful, though. Quicksort runs very well on some data, but it has a worst-case time complexity of O(n^2). For some implementations, this is a serious security risk.)

I'm afraid I don't have data for the costs of sorting, but I'd say that retroactive sorting captures the essence of what you're trying to do better and is therefore the better choice. Based on the relative complexity of priority queue management versus post-sorting, I'd say that post-sorting should be faster. But again, you should test this.

I am generating some items that I need to be sorted at the end. I was wondering, what is faster in terms of complexity: inserting them directly in a priority-queue or a similar data structure, or using a sort algorithm at end?

We're probably covered this above.

There's another question you didn't ask, though. And perhaps you already know the answer. It's a question of stability. The C++ STL says that the priority queue must maintain a "strict weak" order. This means that elements of equal priority are incomparable and may be placed in any order, as opposed to a "total order" where every element is comparable. (There's a nice description of ordering here.) In sorting, "strict weak" is analogous to an unstable sort and "total order" is analogous to a stable sort.

The upshot is that if elements of the same priority should stay in the same order you pushed them into your data structure, then you need a stable sort or a total order. If you plan to use the C++ STL, then you have only one option. Priority queues use a strict weak ordering, so they're useless here, but the "stable_sort" algorithm in the STL Algorithm library will get the job done.

Let me know if you'd like a copy of any of the papers mentioned or would like clarification. :-)

Inserting n items into a priority queue will have asymptotic complexity O(n log n) so in terms of complexity, it’s not more efficient than using sort once, at the end.

Whether it’s more efficient in practice really depends. You need to test. In fact, in practice, even continued insertion into a linear array (as in insertion sort, without building a heap) may be the most efficient, even though asymptotically it has worse runtime.

Depends on the data, but I generally find InsertSort to be faster.

I had a related question, and I found in the end the bottleneck was just that I was doing a deffered sort (Only when I ended up needed it) and on a large amount of items, I usually had the worst-case-scenario for my QuickSort (already in order), So I used an insert sort

Sorting 1000-2000 elements with many cache misses

So analyze your data!

To your first question (which is faster): it depends. Just test it. Assuming you want the final result in a vector, the alternatives might look something like this:

#include <iostream>
#include <vector>
#include <queue>
#include <cstdlib>
#include <functional>
#include <algorithm>
#include <iterator>

#ifndef NUM
    #define NUM 10
#endif

int main() {
    std::srand(1038749);
    std::vector<int> res;

    #ifdef USE_VECTOR
        for (int i = 0; i < NUM; ++i) {
            res.push_back(std::rand());
        }
        std::sort(res.begin(), res.end(), std::greater<int>());
    #else
        std::priority_queue<int> q;
        for (int i = 0; i < NUM; ++i) {
            q.push(std::rand());
        }
        res.resize(q.size());
        for (int i = 0; i < NUM; ++i) {
            res[i] = q.top();
            q.pop();
        }
    #endif
    #if NUM <= 10
        std::copy(res.begin(), res.end(), std::ostream_iterator<int>(std::cout,"\n"));
    #endif
}

$ g++     sortspeed.cpp   -o sortspeed -DNUM=10000000 && time ./sortspeed

real    0m20.719s
user    0m20.561s
sys     0m0.077s

$ g++     sortspeed.cpp   -o sortspeed -DUSE_VECTOR -DNUM=10000000 && time ./sortspeed

real    0m5.828s
user    0m5.733s
sys     0m0.108s

So, std::sort beats std::priority_queue, in this case. But maybe you have a better or worse std:sort, and maybe you have a better or worse implementation of a heap. Or if not better or worse, just more or less suited to your exact usage, which is different from my invented usage: "create a sorted vector containing the values".

I can say with a lot of confidence that random data won't hit the worst case of std::sort, so in a sense this test might flatter it. But for a good implementation of std::sort, its worst case will be very difficult to construct, and might not actually be all that bad anyway.

Edit: I added use of a multiset, since some people have suggested a tree:

    #elif defined(USE_SET)
        std::multiset<int,std::greater<int> > s;
        for (int i = 0; i < NUM; ++i) {
            s.insert(std::rand());
        }
        res.resize(s.size());
        int j = 0;
        for (std::multiset<int>::iterator i = s.begin(); i != s.end(); ++i, ++j) {
            res[j] = *i;
        }
    #else

$ g++     sortspeed.cpp   -o sortspeed -DUSE_SET -DNUM=10000000 && time ./sortspeed

real    0m26.656s
user    0m26.530s
sys     0m0.062s

To your second question (complexity): they're all O(n log n), ignoring fiddly implementation details like whether memory allocation is O(1) or not (vector::push_back and other forms of insert at the end are amortized O(1)) and assuming that by "sort" you mean a comparison sort. Other kinds of sort can have lower complexity.

As far as I understand, your problem does not require Priority Queue, since your tasks sounds like "Make many insertions, after that sort everything". That's like shooting birds from a laser, not an appropriate tool. Use standard sorting techniques for that.

You would need a Priority Queue, if your task was to imitate a sequence of operations, where each operation can be either "Add an element to the set" or "Remove smallest/greatest element from the set". This can be used in problem of finding a shortest path on the graph, for example. Here you cannot just use standard sorting techniques.

A priority queue is usually implemented as a heap. Sorting using a heap is on average slower than quicksort, except that quicksort has a worse worst case performance. Also heaps are relatively heavy data structures, so there's more overhead.

I'd reccomend sort at end.

Why not use a binary search tree? Then the elements are sorted at all times and the insertion costs are equal to the priority queue. Read about RedBlack balanced trees here

I think that the insertion is more efficient in almost all cases where you are generating the data (i.e. don't already have it in a list).

A priority queue is not your only option for insertion as you go. As mentioned in other answers a binary tree (or related RB-tree) is equally efficient.

I would also check how the priority queue is implemented - many are based on b-trees already but a few implementations are not very good at extracting the elements (they essentially go through the entire queue and look for the highest priority).

On a max-insert priority queue operations are O(lg n)

There are a lot of great answers to this question. A reasonable "rule of thumb" is

If you have all your elements "up front" then choose sorting.
If you will be adding elements / removing minimal elements "on the fly" then use a priority queue (e.g., heap).

For the first case, the best "worst-case" sort is heap sort anyway and you'll often get better cache performance by just focusing on sorting (i.e. instead of interleaving with other operations).