Architecture and pattern for large scale, time series based, aggregation operation_问答_开发者

I will try to describe my challenge and operation: I need to calculate stocks price indices over historical period. For example, I will take 100 stocks and calc their aggregated avg price each second (or even less) for the last year. I need to create many different indices like this where the stocks are picked dynamically out of 30,000~ different instruments.

The main consideration is speed. I need to output a few months of this kind of index as fast as i can.

For that reason, i think a traditional RDBMS are too slow, and so i am looking for a sophisticated and original solution.

Here is something i had In mind, using NoSql or column oriented approach: Distribute all stocks into some kind of a key value pairs of time:price with matching time rows on all of them. Then us开发者_StackOverflow社区e some sort of a map reduce pattern to select only the required stocks and aggregate their prices while reading them line by line.

I would like some feedback on my approach, suggestion for tools and use cases, or suggestion of a completely different design pattern. My guidelines for the solution is price (would like to use open source), ability to handle huge amounts of data and again, fast lookup (I don't care about inserts since it is only made one time and never change)

Update: by fast lookup i don't mean real time, but a reasonably quick operation. Currently it takes me a few minutes to process each day of data, which translates to a few hours per yearly calculation. I want to achieve this within minutes or so.

In the past, I've worked on several projects that involved the storage and processing of time series using different storage techniques (files, RDBMS, NoSQL databases). In all these projects, the essential point was to make sure that the time series samples are stored sequentially on the disk. This made sure reading several thousand consecutive samples was quick.

Since you seem to have a moderate number of time series (approx. 30,000) each having a large number of samples (1 price a second), a simple yet effective approach could be to write each time series into a separate file. Within the file, the prices are ordered by time.

You then need an index for each file so that you can quickly find certain points of time within the file and don't need to read the file from the start when you just need a certain period of time.

With this approach you can take full advantage of today's operating systems which have a large file cache and are optimized for sequential reads (usually reading ahead in the file when they detect a sequential pattern).

Aggregating several time series involves reading a certain period from each of these files into memory, computing the aggregated numbers and writing them somewhere. To fully leverage the operating system, read the full required period of each time series one by one and don't try to read them in parallel. If you need to compute a long period, then don’t break it into smaller periods.

You mention that you have 25,000 prices a day when you reduce them to a single one per second. It seems to me that in such a time series, many consecutive prices would be the same as few instruments are traded (or even priced) more than once a second (unless you only process S&P 500 stocks and their derivatives). So an additional optimization could be to further condense your time series by only storing a new sample when the price has indeed changed.

On a lower level, the time series files could be organized as a binary files consisting of sample runs. Each run starts with the time stamp of the first price and the length of the run. After that, the prices for the several consecutive seconds follow. The file offset of each run could be stored in the index, which could be implemented with a relational DBMS (such as MySQL). This database would also contain all the meta data for each time series.

(Do stay away from memory mapped files. They're slower because they aren’t optimized for sequential access.)

If the scenario you described is the ONLY requirement, then there are "low tech" simple solutions which are cheaper and easier to implement. The first that comes to mind is LogParser. In case you haven't heard of it, it is a tool which runs SQL queries on simple CSV files. It is unbelievably fast - typically around 500K rows/sec, depending on row size and the IO throughput of the HDs.

Dump the raw data into CSVs, run a simple aggregate SQL query via the command line, and you are done. Hard to believe it can be that simple, but it is.

More info about logparser:

Wikipedia
Coding Horror

What you really need is a relational database that has built in time series functionality, IBM released one very recently Informix 11.7 ( note it must be 11.7 to get this feature). What is even better news is that for what you are doing the free version, Informix Innovator-C will be more than adequate. http://www.freeinformix.com/time-series-presentation-technical.html