I have millions of "people" records - say the customers for client one and customers for client two. We want to match the people in client one and client two together - for example finding that "Mr Joel Spolsky" is in client one database and matching him to "J Spolsky" in client two, creating a totally new record in "master database".
THe exact algorthim to match two candidates is not important, what is important is that the most obvious solution is to take every record in client one and compare to every record in client two.
This开发者_开发问答 quickly becomes an enormous task, especially with clients three four five etc.Does anyone have any interesting approaches to improve performance?
The only way to avoid O(n^2) (or O(n^m) if there are more then 2 clients) is to sort the databases before searching.
But in order to be able to sort them you will have to come up with some normalized field which will always exactly match clients. (e.g. last word in the name field + postcode and all of that forced to lower case)
If you are able to sort the databases you can get your algorithm down to O(n log n)
The most obvious way is to create a common sort algorithm for all databases. Sort your databases into lists, compare the "top" items in each sorted list, then keep discarding the "earliest" item until you find two items that match. Record the match, discard the matches, and continue.
This works very well if you have, say, two sets of book ISBN numbers to compare to find duplicates between two libraries, but not so well with people's names where names may not be identical (J Smith vs John Smith, eg). You can improve things somewhat by using a sort of KWIC scheme where you make several entries in your sorted list for each DB entry -- eg, one entry for name, one entry for address, one entry for social security number -- whatever criteria you might decide to match on. A Soundex-type translation of the names may also be beneficial.
The matching algorithm is important. If you don't know anything about the matching algorithm you have to compare each against every other one in the other client database and you end up with O(N^2).
it strongly depends on the database. Usually "intersect" is the fastest.
Now, you have a subtle difference between the two names in your database: "Mr Joel Spolsky" and "J Spolsky"
That's mean preprocessing the table. to be sure name matches, and maybe write your own "phonetic" index. that's seems out of topic, but if you have column "name" and "firstname" matching, but not column "prefix", what you do? (by mr and mrs alex jones).
Before you know it you end up with a rules engine, a decision making engine, and an interface for all "manual" cases (who are not auto-merge, or definite not the same), and 3 students merging (or flagging as "notsame") millions of addresses full time.
So before you get there, define what you want to merge exactly, then the algorithm can be chosen easily
精彩评论