I have data in the form of:
ID ATTR
3 10
1 20
1 20
4 30
... ...
Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may be anywhere between 100,000 and 500,000 pairs that I need to process at a single time.
I am looking for:
- The number of unique pairs.
- The number of times a non-unique pair pops up.
So in the above data, I'd want to know that (1,20) appeared twice and that there were 3 unique pairs.
I'm currently using a hash table in my naive approach. I keep a counter of uniq开发者_如何学运维ue pairs, and decrement the counter if the item I am inserting is already there. I also keep an array of IDs of the non-unique pairs. (All on first encounters)
Performance and size are about equal concerns. I'm actually OK with a relatively high (say 0.5%) rate of false positives given the performance and size concerns. (I've also implemented this using a spectral bloom)
I'm not that smart, so I'm sure there's a better solution out there, and I'd like to hear about your favorite hash table implementations/any other ideas. :)
A hash table with keys like <id>=<attr>
is an excellent solution to this problem. If you can tolerate errors, you can get smaller/faster with a bloom, I guess. But do you really need to do that?
精彩评论