Does someone really sort terabytes of data?_问答_开发者

I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?

I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the dat开发者_C百科a into smaller size and sort each of them and merge them finally.

But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?

In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?

But in reality, does companies like Amazon/Ebay, sort terabytes of data? I know, they store tons of info but sorting them???

Yes. Last time I checked Google processed over 20 petabytes of data daily.

Why wouldn't they keep them sorted at the first place instead of sorting terabytes of data, is my question in a nutshell.

EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.

Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.

Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.

For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.

http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/

The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.

However for sorting 1 TB I would use map-reduce using Hadoop. Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.

Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.

Here is the link to the identity mapper: http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.

Yes, certain companies certainly sort at least that much data every day.

Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.

Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.

Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.

Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.

Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.

Big companies do sort tera and petabytes of data regularly. I've worked for more than one company. Like Dean J said, companies rely on frameworks built to handle such tasks efficiently and consistently. So,the users of the data do not need to implement their own sorting. But the people who built the framework had to figure out how to do certain things (not just sorting, but key extraction, enriching, etc.) at massive scale. Despite all that, there might be situations when you will need to implement your own sorting. For example, I recently worked on data project that involved processing log files with events coming from mobile apps. For security/privacy policies certain fields in the log files needed to be encrypted before the data could be moved over for further processing. That meant that for each row, a custom encryption algorithm was applied. However, since the ratio of Encrypted to events was high (the same field value appears 100s of times in the file), it was more efficient to sort the file first, encrypt the value, cache the result for each repeated value.