Parallelized record combining - matching on multiple keys_问答_开发者

Parallelized record combining - matching on multiple keys

开发者 https://www.devze.com 2022-12-13 18:44 出处：网络

I have been looking at using MapReduce to build a parallelized record combining system. The language doesn\'t matter, I can use a pre-existing library such as Hadoop or build my own if necessary, I\'m

I have been looking at using MapReduce to build a parallelized record combining system. The language doesn't matter, I can use a pre-existing library such as Hadoop or build my own if necessary, I'm not worried about that.

The problem that I keep running into, however, is that I need the records to be matched on multiple criteria. For example: I may need to match the records based on person's name or the person's phone number, but not necessarily the person's name and phone number.

For instance, given the following keys for each record:

'John Smith' and '555-555-5555'
'Jane Smith' and '555-555-5555'
'John Smith' and '555-555-1111'

I want the system to take all three records, figure out that they match on one of the keys, and combine them into a single combined record that has both names ('John Smith' and 'Jane Smith') as well as both phone numbers ('555-555-5555' and '555-555-1111').

Is this something that I can accomplish using MapReduce? If so, how would I go about matching the keys produced by the Map function so that all of the mat开发者_开发知识库ched records can be passed into the Reduce function.* Alternatively, is there a different/better way I could be doing this? My only real requirement is that I need it parallelized.

[*] Please note: I am assuming that the Reduce function could be used in such a way that each call to the Reduce function produces a single combined record, rather than the Reduce function producing a single result for the entire job.

You can definitely do this in the map/reduce paradigm.

Let's say that you're matching on anything containing "smith" or phone numbers starting with "555". You would canonicalize your search string into "smith|^555", for example. In the Map phase, you would do:

John Smith / 555-555-5555 → K: smith|^555, V = (John Smith,555-555-5555)
Jane Doe / 555-555-5555 → K: smith|^555, V = (Jane Doe,555-555-5555)
John Smith / 555-555-1111 → K: smith|^555, V = (John Smith,555-555-1111)

Since you've given them all the same key ("smith|^555") they will all be handed off to the same reducer instance, which would now get, as input:

K: smith|^555, V: [(John Smith,555-555-5555),(Jane Smith,555-555-5555),(John Smith,555-555-1111))

Now, in your reducer step, you can instantiate a hashset for names and another one for numbers, and then when done processing the array of values, output all the keys from the names hashset and all the keys from the numbers hashset.

I don't think Map is useful here, because you can't really create a meaningful key for each record that will help identify the groupings of records.

It is not possible to implement this using Reduce either. Consider the example you yourself gave... If you query for 'Jane Smith', you cannot detect at the time that the first record is related to the query and so will ignore it. In fact you could end up chaining names and numbers together until you've got every record in the file. The only way to pick up all the matches is iteratively scan over the list until you stop finding new links.

This is very easy to parallelize though, you can just share out the records amongst some number of threads, and each can search its own records for new links. I'd suggest treating these sets as rings of data, so that you can record the point you were searching with the most up to date information, and you know you're finished once all threads have done a complete loop.