I need to read sorted array from input to awk/gawk and get median. I don't want to store the whole array and am trying to get cons开发者_JAVA技巧tant space for the calculation.
Are you aware of any algorithm doing this? Given the array is sorted but its size is unknown.
Thank you in advance!
There is no algorithm to exactly find the median of a sorted sequence of unknown length that runs with a fixed amount of memory.
To see this, consider such an algorithm. Say it has a buffer of length N
for holding items from the sequence. Until this buffer is full, the algorithm simply puts items in it, tracking the median while it does so.
When the algorithm scans the N+1
th item and beyond, it must choose one item to drop at each step. Suppose it has already scanned 2N
items, dropping half of them. Let's give it the benefit of the doubt, and say it has not yet dropped median of the input stream.
Consider when it is scans the 2N+1
th item. Which item should it drop? It can't drop the least element it has kept so far, since input may be exhausted after this item, in which case the lowest may be the median. Likewise for any possible element it may drop, there is a future to the input sequence that makes this dropped element the median.
If you are willing to take approximate results, then this estimator may work for you.
Take two passes, using the first just to work out the size of the array, and storing the data in a file, if necessary. Otherwise you can't do it without storing the array, because if you take the state of the program after reading n items, then by feeding it enough very large numbers you can retrieve any of the last n/2 items as the median, so the program must in fact be remembering at least those items.
Basically what you're asking for is an "algorithm" to find the size N
of the array, because the median will be element number (N+1)/2
(ignoring even/odd details for now).
I can't think of an algorithm that doesn't involve two passes. By definition, you need a first pass to figure out N
.
While scanning element i+1
, you could keep a buffer of the previous i/2
elements. When you reach the end of the array, the median would just be the first value in the buffer, i.e requiring only one pass. The problem with this is that you would have to allocate enough memory for the buffer to contain N/2
elements -- but you don't know what N
is, so you don't know how large the buffer should be! Also if N
values is too large to store, as you state in the question, then presumably N/2
values is also too large to store (otherwise my advice would be: just double your RAM).
So this buffer approach isn't an option. Two passes it is. One to figure out N
, one to get element (N+1)/2
.
精彩评论