开发者

Which is faster: Appropriate data input or appropriate data structure?

开发者 https://www.devze.com 2022-12-30 19:58 出处:网络
I have a dataset whose columns look like this: Consumer ID | Product ID | Time Period | Product Score 1| 1| 1| 2

I have a dataset whose columns look like this:

Consumer ID | Product ID | Time Period | Product Score
1           | 1          | 1           | 2
2           | 1          | 2           | 3

and so on.

As part of a program (written in C) I need to process the product scores given by all consumers for a particular product and time period combination for all possible combinations. Suppose that there are 3 products and 2 time periods. Then I need to process the product scores for all possible combinations as shown below:

Product ID | Time Period 
1          | 1
1          | 2
2          | 1
2          | 2
3          | 1
3          | 2

I will need to process the data along the above lines lots of times (> 10k) and the dataset is fairly large (e.g., 48k consumers, 100 products, 24 time periods etc). So speed is an issue.

I came up with two ways to process the data and am wondering which is the faster approach or perhaps it does not matter much? (speed matters but not at the cost of undue maintenance/readability):

  1. Sort the data on pro开发者_Python百科duct id and time period and then loop through the data to extract data for all possible combinations.

  2. Store the consumer ids of all consumers who provided product scores for a particular combination of product id and time period and process the data accordingly.

Any thoughts? Any other way to speed up the processing? Thanks


As with many performance-related questions, the only real, definitive answer is to write a benchmark. Speed will depend on many things, and it doesn't sound like you're talking about a straightforward case of a linear algorithm vs a quadratic algorithm (and even that would have an additional dependency on input size).

So implement both methods, run them on sample data, and time the results. This will be much faster and more conclusive than us trying to work it out in our heads with limited information.


Actually the two methods looks very similar to me. In order to store the costumer id of all customers who provided score for a specific combination you need to sort the data or perform a more expensive operation.

Can you exchange space for time? If yes then do no pre-processing what so ever, but create an array of all combinations (10x24) to store the scores. Process the data as they come and update the score of the specific combination. If you need the average score, store both the sum and the number of customers that provided score.


The slowest part that you have any influence over would probably be copying chunks of memory. So the first technique to apply would be to place the values for each row in a structure and refer to it by pointer only until all you processing is complete. The structures would be something like:

typedef struct {
 int consumer;
 int product;
 int time;
 int score;
} rowData;

Building upon that I would think that you would be best to loop through the input rows and build up a binary tree (or other sorted structure) of structures that are identified by consumer and product, and contain a look up table of pointers to all matching rowData:

typedef struct {
 int consumer;
 int product;
 rowData * matches;
} matchLut;

Once all the rows have been placed in look up tables on the tree then each bundle can be processed.


I would suggest filter the data, like in step two, then processes as per step one. If your performance is unacceptable, tune for performance. Set some benchmarks for your baseline, then try some different approaches.

In most real world situations, I would advise against implementing multiple methods simply for the sake of benchmarking. The performance is likely to be similar. If its not similar, its probably performing poorly and will be in obvious need of tuning. Your time is probably better spent implementing other features.


If memory permits store your data in a 2d array (really 3d, but I'll get to that later). This array will be indexed by (product_id, time_period).

If your processing of the data permits it each element of the 2D array could be an accumulator of the new data so you read in a data element and then adjust the corresponding element of the the 2D array to reflect it. If this method works your data will be processed when you finish reading it in.

If your processing requires you to have data from each data element present at one time then you could make each element of your 2D array a list (this is the 3rd D). It could be a variable length list if you don't know how many customer entries will be present for each (product_id, time_period). After you have read in your data you will need to revisit each element of the 2D array to process each list. How you arrange your array and how you visit the elements will matter for performance. You will probably want to declare this dynamically, but for this example

struct element_t element[NUMBER_OF_PRODUCTS][NUMBER_OF_TIME_PERIODS];
// don't forget to initialize these elements to empty
...
for (p = max_product_id; p >= 0; p--) {
    for (t = max_time_period; t >= 0; t--) {
         process(element[p][t]);
    }
}

Will work better if you want to process each product before moving to the next because. You can swap the declaration's row,column and the loops to achieve better cache hits if you want to process each time period (for all products) before moving to the next.

You should note that this does the sorting for you without saying "sort this data".

If memory does not permit, then you will probably want to store parts of your data to files as you read it in. This will have the same issues as the array/loop organization/cache hit optimization mentioned above, but it will be magnified many times over. At the end of reading in your main data you would want to be able to process all data from a particular temp file (possibly containing all data for a given product (xOR for a given time period)) before moving to the next. The main bad part of trying to do this is that when you are reading in the data it is very likely that you will have to deal with not being able to have every temp file open at the same time. This might require you to come up with a way of doing open file swapping (same as memory swapping, except what you are swapping is open files rather than memory pages). This would be a whole other problem, though.


I suggest you reorder your data according to the most frequently accessed processes. The most frequently accessed data should be easiest and fastest to access.

Also, take a look at Database Normalization. This is a concept of organizing data for least amount of duplication, and also makes accessing the data more efficient.

Another idea is to use indices for less popular data searches.


That would make for a smallish database table. It's about 0.4GB of data if the full matrix of consumers/products/time exist. Have you considered running the whole thing in SQL? Even if you don't us a full up database; for that size of data, it would be practical to generate a full table for each of the possible sort orders and dump each to a file. You could then load whatever file you need to walk it in whatever order you choose.

Also, if you can run the full 10K passes in parallel or at least a few dozen per pass, you might be ahead to do so as it may drastically reduce your IO waits and/or cache misses.

0

精彩评论

暂无评论...
验证码 换一张
取 消