开发者

Efficient set operations in mapreduce

开发者 https://www.devze.com 2023-01-18 03:37 出处:网络
I have inherited a mapreduce codebase which mainly calculates the number of unique user IDs seen over time for different ads. To me it doesn\'t look like it is being done very efficiently, and I would

I have inherited a mapreduce codebase which mainly calculates the number of unique user IDs seen over time for different ads. To me it doesn't look like it is being done very efficiently, and I would like to know if anyone has any tips or suggestions on how to do this kind of calculation as efficiently as possible in mapreduce.

We use Hadoop, but I'll give an example in pseudocode, without all the cruft:

map(key, value):
  ad_id = .. // extract from value
  user_id = ... // extract from value
  collect(ad_id, user_id)

reduce(ad_id, user_ids):
  uniqe_user_ids = new Set()
  foreach (user_id in user_ids):
    unique_user_ids.add(user_id)
  collect(ad_id, unique_user_ids.size)

It's not much code, and it's not very hard to understand, but it's not very efficient. Every day we get more data, and so every day we need to look at all the ad impressions from the beginning to calculate the number of unique user IDs for that ad, so each day it takes longer, and uses more memory. Moreover, without having actually profiled the code (not sure how to do that in Hadoop) I'm pretty certain that almost all of the work is in creating the set of unique IDs. It eats enormous amounts of memory too.

I've experimented with non-mapreduce solutions, and have gotten much better performance (but the question there is how to scale it in the same way that I can scale with Hadoop), but it feels like there should be a better way of doing it in mapreduce that the code I have. It must be a common enough problem for others to have solved.

How do you implement the counting of 开发者_StackOverflow社区unique IDs in an efficient manner using mapreduce?


The problem is that the code you inherited was written with the mindset "I'll determine the unique set myself" instead of the "let's leverage the framework to do it for me".

I would something like this (pseudocode) instead:

map(key, value):
  ad_id = .. // extract from value
  user_id = ... // extract from value
  collect(ad_id & user_id , unused dummy value) 

reduce(ad_id & user_id , unused dummy value):
  output (ad_id , 1); // one unique userid.

map(ad_id , 1): --> identity mapper!
  collect(ad_id , 1 ) 

reduce(ad_id , set of a lot of '1's):
  summarize ;
  output (ad_id , unique_user_ids); 


Niels' solution is good, but for an approximate alternative that is closer to the original code and uses only one map reduce phase, just replace the set with a bloom filter. The membership queries in a bloom filter have a small probability of error, but the size estimates are very accurate.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号