Limited Sort/Filter Algorithm_问答_开发者_运维开发者技术经验分享

I have a rather large list of elements (100s of thousands).

I have a filter that can either accept or not accept elements.

I want the top 100 elements that satisfy the filter.

So far, I have sorted the results first and then taken the top 100 that satisfy the filter. The rationale behind this is that the filter is not entirely fast.

But right now, the sorting step is taking way longer than the filtering step, so I would like to combine them in some way.

Is there an algorithm to combine the concerns of sorting/filtering to get the top 100 results satisfying the filter without incurring the cost of sorting all of the elem开发者_运维问答ents?

My instinct is to select the top 100 elements from the list (much cheaper than a sort, use your favorite variant of QuickSelect). Run those through the filter, yielding n successes and 100-n failures. If n < 100 then repeat by selecting 100-n elements from the top of the remainder of the list:

k = 100
while (k > 0):
    select top k from list and remove them
    filter them, yielding n successes
    k = k - n

All being well this runs in time proportional to the length of the list, since each selection step runs in that time, and the number of selection steps required depends on the success rate of the filter, but not directly on the size of the list.

I expect this has some bad cases, though. If almost all elements fail the filter then it's considerably slower than just sorting everything, since you'll end up selecting thousands of times. So you might want some criteria to bail out if it's looking bad, and fall back to sorting the whole list.

It also has the problem that it will likely do a largeish number of small selects towards the end, since we expect k to decay exponentially if the filter criteria are unrelated to the sort criteria. So you could probably improve it by selecting somewhat more than k elements at each step. Say, k divided by the expected success rate of the filter, plus a small constant. The expectation based on past performance if there's no domain knowledge you can use to predict it, and the small constant chosen experimentally to avoid an annoyingly large number of steps to find the last few elements. If you end up at any step with more items that have passed the filter than the number you're still looking for (i.e, n > k), then select the top k from the current batch of successes and you're done.

Since QuickSelect gives you the top k without sorting those k, you'll need to do a final sort of 100 elements if you need the top 100 in order.

I've solved this exact problem by using a binary tree for sorting and by keeping count of the elements to the left of the current node during insertion. See http://pub.uni-bielefeld.de/publication/2305936 (Figure 4.4 et al) for details.

If I understand right, you have two choiced:

Selecting 100 Elements - N operations of the filter check. Then 100(lg 100) for the sort.
Sorting then selecting 100 Elements - At least N(lg N) for the sort, then the select.

the first sounds shorter then sorting then selecting.

I'd probably filter first, then insert the result of that into a priority queue. Keep track of the number of items in the PQ, and after you do the insert, if it's larger than the number you want to keep (100 in your case), pop off the smallest item and discard it.

Steve's suggestion to use Quicksort is a good one.

1 Read in the first 1000 or so elements.
2 Sort them and pick the 100th largest element.
3 Run one pass of Quicksort on the whole file with the element from step 2 as the pivot.
4 Select the upper half of the result of the Quicksort pass for further processing.

You are guaranteed at least 100 elements in the upper half of the single pass of Quicksort. Assuming the first 1000 are reasonably representative of the whole file then you should end up with about one tenth of the original elements at step 4.