Pos开发者_StackOverflow社区sible Duplicate:
Algorithm to find k smallest numbers in array of n items
How do you find the first 20 smallest elements in a very big array ?
You have two options
- Sort the array and pull the 20 elements on the small end (depends on which direction you sort the array, right?)
- Keep a sorted set (may not be a set due to nonuniqueness of the array) of elements of the array. Add the first 20 elements in the array. Each time you find one smaller than the highest element in the 'good set', replace the highest element with this new element.
The second one may seem slower, but it really depends on the size of the array. You could do it with one pass through the array, so it might be better to do this on an array of eight billion or something.
Edit: the first algorithm is O(n lg n)
. The second algorithm is O(k n)
, where k in this case is 20 (you want the first 20). So the second algorithm is faster when lg n > 20
or n > 2^20
or n > ~1 million
. So if you have less than a million you're better off sorting. If you have more than a million you're better off making the external list and going through with one pass.
If the array is realy big, sorting it would take a long time and a lot of space.
What you need:
Copy the first 20 elements of the array A into a new array B.
Sort B
Walk over array A and for each element check if it is smaller than B[19]
if yes => add it to B, sort B, delete the last element of B
Not sure if it will be optimal but you can try to run 20 iteration of inserition sort.
For God's sake, don't sort the whole array. Have an array of size 20 initialized to the first 20 elements of the big array. Now, step through the big array, replacing any element in the small array bigger than the one from the big array you are currently considering. This is O(n); better than any comparison based sort will ever do, and possibly more efficient (with a good implementation) than linear sorts (which can't always be used, anyway).
EDIT:
So, out of curiosity, I implemented the naive version of the linear algorithm and compared it to the C++ STL sort() function. Here are my results - they show that, as I expected, the linear algorithm is, on average, always better than sorting - even if, in the theoretical worst case for the linear algorithm, you would need a larger array for it to win. Here are my performance figures:
N Sort Linear Common
32, 378, 170, 116
64, 831, 447, 237
128, 1741, 1092, 424
256, 5260, 2211, 865
512, 10955, 5944, 1727
1024, 20451, 10529, 3584
2048, 38459, 21723, 7011
4096, 77697, 41023, 14136
8192, 150630, 82919, 28083
16384, 311593, 166740, 55978
32768, 648331, 334612, 111891
65536, 1329827, 673030, 224665
131072, 2802540, 1342430, 449553
262144, 5867379, 2717356, 896673
524288, 12082264, 5423038, 1798905
1048576, 25155593, 10941005, 3658716
2097152, 62429382, 24501189, 8940410
4194304, 120370652, 44820562, 14843411
N is the problem size, Sort is the sort time in microseconds, Linear is the linear algorithm time in microseconds, and Common is the time spent randomizing the array before each of the tests. Note that to get just the time spent in the Sort and Linear algorithms, you would need to subtract from the values in columns two and three the values in column four. If you would like me to do this, I would be happy. Still, it's clear that linear is faster than sorting. Each N was tested 100 times, and these are aggregate figures (summed times) from all 100 tests. Here is the code I used:
void randomize(unsigned char *data, int n) {
for(int i = 0; i < n; i++)
data[i] = (unsigned char)(rand() % 256);
}
void sorttest(unsigned char *data, int n) {
unsigned char results[20];
sort(data, data + n);
for(int i = 0; i < 20; i++)
results[i] = data[i];
}
void scantest(unsigned char *data, int n) {
unsigned char results[20];
for(int i = 0; i < 20; i++)
results[i] = data[i];
for(int i = 20; i < n; i++)
for(int j = 0; j < 20; j++)
if(data[i] < results[j]) {
results[j] = data[i];
break;
}
}
void dotest(int n)
{
unsigned char *data = (unsigned char*)malloc(n);
timeval t1, t2, t3, t4, t5, t6;
gettimeofday(&t1, 0);
for(int i = 0; i < 100; i++) {
randomize(data, n);
sorttest(data, n);
}
gettimeofday(&t2, 0);
gettimeofday(&t3, 0);
for(int i = 0; i < 100; i++) {
randomize(data, n);
scantest(data, n);
}
gettimeofday(&t4, 0);
gettimeofday(&t5, 0);
for(int i = 0; i < 100; i++)
randomize(data, n);
gettimeofday(&t6, 0);
int dt1 = 1000000*(t2.tv_sec - t1.tv_sec) + (t2.tv_usec - t1.tv_usec);
int dt2 = 1000000*(t4.tv_sec - t3.tv_sec) + (t4.tv_usec - t3.tv_usec);
int dt3 = 1000000*(t6.tv_sec - t5.tv_sec) + (t6.tv_usec - t5.tv_usec);
printf("%10d, %10d, %10d, %10d\n", n, dt1, dt2, dt3);
free(data);
}
int main() {
srand(time(0));
for(int i = 32; i < 5000000; i*=2) dotest(i);
return 0;
}
I invite anybody who is claiming that sorting is just as good to point out how I can modify this benchmark to be more fair/correct so that sorting comes out on top. No really; feel free to experiment with it yourselves.
精彩评论