开发者

Reducing the granularity of a data set

开发者 https://www.devze.com 2023-03-11 11:03 出处:网络
I have an in-memory cache which stores a set of information by a certain level of aggregation - in the Students example below let\'s say I store it by Year, Subject, Teacher:

I have an in-memory cache which stores a set of information by a certain level of aggregation - in the Students example below let's say I store it by Year, Subject, Teacher:

#    Students    Year    Subject    Teacher
1    30          7       Math       Mrs Smith
2    28          7       Math       Mr Cork
3    20          8       Math       Mrs Smith
4    20          8       English    Mr White
5    18          8       English    Mr Book
6    10          12      Math       Mrs Jones

Now unfortunately my cache doesn't have GROUP BY or similar functions - so when I want to look at things at a higher level of aggregation, I will have to 'roll up' the data myself. For example, if I aggregate Students by Year, Subject the aforementioned data would look like so:

#    Students    Year    Subject
1    58          7       Math
2    20          8       Math 
3    38          8       English
4    10          12      Math

My question is thus - how would I best do this in Java? Theoretically I could be pulling back tens of thousands of objects from this cache, so being开发者_运维技巧 able to 'roll up' these collections quickly may become very important.

My initial (perhaps naive) thought would be to do something along the following lines;

Until I exhaust the list of records:

  • Each 'unique' record that I come across is added as a key to a hashmap.
  • If I encounter a record that has the same data for this new level of aggregation, add its quantity to the existing one.

Now for all I know this is a fairly common problem and there's much better ways of doing this. So I'd welcome any feedback as to whether I'm pointing myself in the right direction.

"Get a new cache" not an option I'm afraid :)

-Dave.


Your "initial thought" isn't a bad approach. The only way to improve on it would be to have an index for the fields on which you are aggregating (year and subject). (That's basically what a dbms does when you define an index.) Then your algorithm could be recast as iterating through all index values; you wouldn't have to check the results hash for each record.

Of course, you would have to build the index when populating the cache and maintain it as data is modified.

0

精彩评论

暂无评论...
验证码 换一张
取 消