开发者

How do I find the median of numbers in linear time using heaps?

开发者 https://www.devze.com 2022-12-26 06:29 出处:网络
Wikipedia says: Selection algorithms开发者_运维百科: Finding the min, max, both the min and max, median, or

Wikipedia says:

Selection algorithms开发者_运维百科: Finding the min, max, both the min and max, median, or even the k-th largest element can be done in linear time using heaps.

All it says is that it can be done, and not how.

Can you give me some start on how this can be done using heaps?


You would use a min-max-median heap to find the min, max and median in constant time (and take linear time to build the heap). You can use order-statistics trees to find the kth smallest/largest value. Both of these data structures are described in this paper on min-max heaps [PDF]. Min-max heaps are binary heaps that alternate between min-heaps and max-heaps.

From the paper:

A min-max-median heap is a binary tree with the following properties:

  1. The median of all elements is located at the root

  2. The left subtree of the root is a min-max heap Hl of size ceiling[((n-1)/2)] containing elements less than or equal to the median. The right subtree is a max-min heap Hr of size floor[((n-1)/2)] containing only elements greater than or equal to the median.

The paper goes on to explain how to build such a heap.

Upon reading the paper more thoroughly it appears as though building the min-max-median heaps requires that you first find the median (FTA: "Find the median of all n elements using any one of the known linear-time algorithms"). That said, once you have built the heap you can maintain the median simply by maintaining the balance between the min-max heap on the left and the max-min heap on the right. DeleteMedian replaces the root with either the min of the max-min heap or the max of the min-max heap (whichever maintains the balance).

So if you plan on using a min-max-median heap to find the median of a fixed data set you're SOL but if you are using it on a changing data set it is possible.


See this wikipedia page on selection algorithms. In particular, look at the BFPRT algorithm and the Median of Medians algorithm. BFPRT is probabilistically linear, and is modelled on quicksort; Median of Medians is guaranteed linear, but has a large constant factor and so might take longer in practice, depending on the size of your dataset.

If you only have a few hundred or thousand elements from which to select the median, I suspect that a simple quicksort followed by direct indexing is easiest.


There are likely better algorithms out there, but here's how I'd do it:

Have two buckets and a value. The value is the median, the two buckets are "bigger than median" and "smaller than median". For each element x in the array, rebalance the buckets such that big_bucket and small_bucket differ by no more than 1 in their size. When moving items from the big bucket to the small bucket they first must pass through the median value to get there (that is, a difference of 2 will successfully push an element from one bucket to the next - a difference of 1 will push an element from one bucket to the median value.) At the end of your first pass through the array the value should be your median.


Perhaps it wasn't around when the original question was asked, but now wiki has a link to the source, and here it has been: http://ftp.cs.purdue.edu/research/technical_reports/1991/TR%2091-027.pdf

Specifically, go to page 17, and look at the description of RSEL4. They prove in theorem 3.2 that the time complexity of this k-th selection algorithm is O(k). So it would take you O(n) to build the heap, and an extra O(k) to find k-th smallest item.

It's not really as straightforward as some of the other answers have suggested.


Store the first integer in the array and set a counter to 1. Then loop through the remaining integers in the vector. If the current integer in the array is the same as the one stored, the counter is increased by one, otherwise the counter is decreased by one. If the counter ever reaches zero, throw away the stored integer and replace it with the current integer in the array. When you finally have looped through all integers you are left with one candidate. You then need to loop through the array again and count the occurrence of the candidate to verify that this really is a dominator.

static int FindDominator(int[] arr)
{
int counter = 1;
int candidate = arr[0];
for(int i = 1; i < n; i++)
{
   if(arr[i] == candidate) counter++
    else 
   {
        counter--;
        if(counter == 0) { candidate = arr[i]; counter = 1; }
    }
}
counter = 0;
for(int i = 0;  i < n; i++)
{
    if(arr[i] == candidate) counter++;
}
if(counter > n / 2) return candidate;
else return -1;
}


if you know more about heap data structure, you will easily understand that that s actually the case. heap structure can be built in O(n) time, there is min heap and max heap. min heap root element will give you the smallest element. max heap root element will give you the max element. Just by building the heap you find the min and max. same idea for median and kth largest, while building your heap, you can find the median and kth largest by looking at left or right branch of the tree and by keeping a constant amount of memory to store the element number. etc.


Obviously, min and max in O(n) is easy and don't require a heap.

K'th largest can be done fairly simply by maintaining a k-sized heap of the top k values so far. Runtime would be O(n*logk). You could call that linear time if k is fixed size and k << n.

I don't think median is possible though. Just creating an O(n) sized heap requires O(n*logn) time.

Edit: Ok, after thinking about this a bit more, IVlad is right. You can create a heap in O(n), for a fixed size. But... that doesn't help the OP with his Median question. The linear heap creation technique only produces a valid heap as its final output. The simple approach of doing n inserts, resulting in a valid heap after each step is O(n*logn).

It seems to me that using heaps to find the median would require the use of those running sub-heaps. For instance, there was an answer posted here (that seems to be deleted now) that linked to a blog post suggesting an algorithm for this problem. It tracked the running median using two heaps (the smaller half and the larger half) as it does a single pass through the data. That would require the slower, naive heap approach becuase it depends on maintaining valid heaps as it inserts and removes from them.

Is there some other way to find the median using the linear one-shot heap creation technique?

0

精彩评论

暂无评论...
验证码 换一张
取 消