I am looking for good resources on how to query large volume of data efficiently.
Each data item is represented as many different attributes such as quantity, price, history info, etc. The client will provide different query criteria but without requirement to change the dataset. By simply storing all data into MS SQL is not a good method b/c the scalability of MS SQL is not that good. Here we are targeting many tera byte data and need 200-300 CPU clusters.
I am interested in good resources or books that I can at least do some research.
Did you consider NoSql solution as MongoDb ?
If query speed is not your number one issue you should see if you could build a solution with ROOT, possibly in conjunction with PROOF. In contrast to a NoSql solution you would here trade consistency for some speed.
It is used by the CERN experiments to store and retrieve their experimental data (much more than you require) and if you can find a way to handle the I/O it can be made to scale pretty well.
I have heard it is used by some firms doing quantitative finance.
精彩评论