开发者

Counting number of unique pairs and instances of non-unique pairs in unsorted data

开发者 https://www.devze.com 2023-04-07 04:34 出处:网络
I have data in the form of: IDATTR 310 120 120 430 ...... Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may b

I have data in the form of:

ID   ATTR
3    10
1    20
1    20
4    30
...  ...

Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may be anywhere between 100,000 and 500,000 pairs that I need to process at a single time.

I am looking for:

  1. The number of unique pairs.
  2. The number of times a non-unique pair pops up.

So in the above data, I'd want to know that (1,20) appeared twice and that there were 3 unique pairs.

I'm currently using a hash table in my naive approach. I keep a counter of uniq开发者_如何学运维ue pairs, and decrement the counter if the item I am inserting is already there. I also keep an array of IDs of the non-unique pairs. (All on first encounters)

Performance and size are about equal concerns. I'm actually OK with a relatively high (say 0.5%) rate of false positives given the performance and size concerns. (I've also implemented this using a spectral bloom)

I'm not that smart, so I'm sure there's a better solution out there, and I'd like to hear about your favorite hash table implementations/any other ideas. :)


A hash table with keys like <id>=<attr> is an excellent solution to this problem. If you can tolerate errors, you can get smaller/faster with a bloom, I guess. But do you really need to do that?

0

精彩评论

暂无评论...
验证码 换一张
取 消