Fastest way to sort files_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-22 21:49 出处：网络

I have a huge text file with lines like: -568.56362615933-1109.660591-1231.2951294.381508 -541.18130815928-1019.279615-1059.1159754.632301

相关专题：bash sorting

I have a huge text file with lines like:

-568.563626  159   33  -1109.660591  -1231.295129  4.381508
-541.181308  159   28  -1019.279615  -1059.115975  4.632301
-535.370812  155   29  -1033.071786  -1152.907805  4.420473
-533.547101  157   28  -1046.218277  -1063.389677  4.423696

What I want is to sort the file, depending on the 5th column, so I would get

-568.563626  159   33  -1109.660591  -1231.295129  4.381508
-535.370812  155   29  -1033.071786  -1152.907805  4.420473
-533.547101  157   28  -1046.218277  -1063.389677  4.423696
开发者_运维问答-541.181308  159   28  -1019.279615  -1059.115975  4.632301

For this I use:

for i in file.txt ; do sort -k5n $i ; done

I wonder if this is the fastest or more efficient way

Thanks

Why use for? Why not just:

sort -k5n file.txt

And what sort is more efficient depends on a number of issues. You could no doubt make a faster sort for specific data sets (size and other properties)- bubble sort can actually outperform other sorts (with particular inputs).

However, have you tested the standard sort and established that it's too slow? That's the first thing you should do. My machine (which is by no means the gruntiest on the planet) can do 4 million of those lines in under ten seconds:

real     0m9.023s
user     0m8.689s
sys      0m0.332s

Having said that, there is at least one trick which may speed it up. Transform the file into fixed-length records with fixed length fields before applying a sort to it. Sorting on a specific set of characters and fixed length records can often be much faster than the more flexible sorting allowed by variable field and record sizes allowed by sort.

That way, you add an O(n) operation (the transformation) to speed up what is probably at best an O(n log n) operation (the sort).

But, as with all optimisations, measure, don't guess!

if you have many different files to sort, you may use a loop, however, since you have only 1 file, just pass the filename to sort