Collecting, storing, and retrieving large amounts of numeric data_问答_开发者

I am about to start collecting large amounts of numeric data in real-time (for those interested, the bid/ask/last or 'tape' for various stocks and futures). The data will later be retrieved for analysis and simulation. That's not hard at all, but I would like to do it efficiently and that brings up a lot of questions. I don't need the best solution (and there are probably many 'bests' depending on the metric, anyway). I would just like a solution that a computer scientist would approve of. (Or not laugh at?)

(1) Optimize for disk space, I/O开发者_StackOverflow speed, or memory?

For simulation, the overall speed is important. We want the I/O (really, I) speed of the data just faster than the computational engine, so we are not I/O limited.

(2) Store text, or something else (binary numeric)?

(3) Given a set of choices from (1)-(2), are there any standout language/library combinations to do the job-- Java, Python, C++, or something else?

I would classify this code as "write and forget", so more points for efficiency over clarity/compactness of code. I would very, very much like to stick with Python for the simulation code (because the sims do change a lot and need to be clear). So bonus points for good Pythonic solutions.

Edit: this is for a Linux system (Ubuntu)

Thanks

Optimizing for disk space and IO speed is the same thing - these days, CPUs are so fast compared to IO that it's often overall faster to compress data before storing it (you may actually want to do that). I don't really see memory playing a big role (though you should probably use a reasonably-sized buffer to ensure you're doing sequential writes).
Binary is more compact (and thus faster). Given the amount of data, I doubt whether being human-readable has any value. The only advantage of a text format would be that it's easier to figure out and correct if it gets corrupted or you lose the parsing code.

Fame is an often-used commercial solution for time-series storage.

If you are serious about this, building your own will be a big job. HDF might be useful, they claim that it is suitable for tick data handling, and have C++ access. There is Python support here.

Useful real-life experience from somebody with the same problem here, including HDF5 refs.

Actually, this is quite similar to what I'm doing, which is monitoring changes players make to the world in a game. I'm currently using an sqlite database with python. At the start of the program, I load the disk database into memory, for fast writing procedures. Each change is put in to two lists. These lists are for both the memory database and the disk database. Every x or so updates, the memory database is updated, and a counter is pushed up one. This is repeated, and when the counter equals 5, it's reset and the list with changes for the disk is flushed to the disk database and the list is cleared.I have found this works well if I also set the writing more to WOL(Write Ahead Logging). This method can stand about 100-300 updates a second if I update memory every 100 updates and the disk counter is set to update every 5 memory updates. You should probobly choose binary, sense, unless you have faults in your data sources, would be most logical

Using D-Bus format to send the information may be to your advantage. The format is standard, binary, and D-Bus is implemented in multiple languages, and can be used to send both over the network and inter-process on the same machine.

If you are just storing, then use system tools. Don't write your own. If you need to do some real-time processing of the data before it is stored, then that's something completely different.

It just occurred to me after reading this thread on storing integers efficiently given certain conditions that we are wasting a lot of bits when we store tick data as doubles or floats or whatever. THE PRICES ARE QUANTIZED! And quite severely, at that. For example, yesterday's NQ range was from about 2175-2191, or about 26 points, quantized by 0.25. So that limits the ticks to ~100 different prices. See where I'm going with this? You only need one byte for each price. Stocks are quantized by 0.01 so you'd need ~ 1 byte for each dollar in the daily range.

So the method I'm outlining is: (1) store high price, low price, and increment as one line header (2) store tick data after that as two bytes, with the two left-most bits used to encode the tick type (00 = last, 01 = bid, 11 = ask)

I think this is something a CS would approve of!