I’m attempting to solve a problem where we are analyzing a substantial amount of data from a table. We need to pull in certain subsets of this data and analyze it. As is, I believe that it would be best to multithread it and bring in as much data as possible initially and perform various computations on each region. Let’s assume that each subset of data to analyze is denoted as S1, S2, … So There will be a thread for each. After performing the calculations, some visualization may be created as well and the results will need to be stored back into the database as there may potentially be many gigabytes worth of data in the analysis results. Let’s assume that the results are denoted by R1, R2, …
Although this is a little vague, I am wondering whether we should create a table for each of R1, R2, etc or store all of the r开发者_JS百科esults in a single table? It will likely be the case that we will want multiple threads storing results at the same time (recall threads for S1, S2) so if there is a single table, I need to ensure that multiple threads can access it at the same time. If it helps, when the data for R1, R2, etc is needed again, all of it will be pulled out and in a certain order that would be easy to maintain if there were a table for each of R1, R2, etc. Also, I was thinking that we could have a single object for each table that manages requests to that particular results table if we go that route. Essentially, I would like the object to be like a bean that only loads in data from that database as necessary (too much to keep in memory at once). Another point is that we are using InnoDB as our storage engine in case that makes any difference as to whether multiple threads can access a particular table.
So, with this bit of information, would it be best to create a set of tables for the results or one for each region of results (possibly 100s)?
Thanks
You could, but then you have to manage 100 tables. And getting statistics for the whole set will be that much more difficult.
If the data can be easily partitioned to different subsets that do not intersect, the database should not be locking rows, especially if you are just doing reads and processing in your application. In such a case you don't need to partition the table into hundreds of tables and each thread in your application can be used independently.
this sounds like a good map reduce candidate. That's assuming that you are going to perform the same calculation on the whole set and just want to speed up the process.
Have you considered using something like MongoDB? you can write your own map reduce aggregations in it.
Map reduce: http://en.wikipedia.org/wiki/MapReduce
mongo : http://www.mongodb.org/display/DOCS/MapReduce
Mongo does support update in place and it's a lockless eventually consistent store.
精彩评论