What logic to use to rollup/merge multiple person entities as the same? (tight, but fuzzy enough to broaden matches)_问答_开发者

What logic to use to rollup/merge multiple person entities as the same? (tight, but fuzzy enough to broaden matches)

开发者 https://www.devze.com 2023-01-20 16:18 出处：网络

I have multiple instances of people entities which are often times the same person. Where the address First-Last is the same at the same address, it\'s a no-brainer to merge/rollup them.

I have multiple instances of people entities which are often times the same person. Where the address First-Last is the same at the same address, it's a no-brainer to merge/rollup them.

However, due to data entry inconsistencies, there must be a way to deviate a bit from the exactness. I think the credit card ind开发者_运维技巧ustry does this a little bit: zip plus street number, or street name? ...something of that nature.

In order to solidify my matching, I cleaned up the address strings, trying to make them as standard as possible ("Hwy" --> "Highway", etc.).

I need something that still will make matches on records that would look obvious to a person just by glancing at them, but fails to have exactly matching data.

Here is my initial thought, concatenate a string made up of the following:

First Initial
LEFT8 of the LastName (allows inconsistent endings, such as "Esq." or "CPA")
LEFT3 of Zip
Street Number
LEFT8 of the StreetName (not Addr1 -- "Oak" for "8 N Oak Street")

Did I miss something here? I think I made it loose enough to overcome normal data entry inconsistencies, but tight enough to avoid incorrect matches.

I was involved in a project to clean up name and address data for a large financial institution. We achieved a success rate automatically of about 98.4% but unfortunately this still left about 150,000 mismatches.

The way we attacked the problem was to (over time) build up a rule base of the types of errors that could occur, and extending the fuzziness of the logic to cover identified classes of error.

A significant amount of data cleansing can indeed be done by reference to (UK) post codes and house number and/or name. In the UK fuzziness can be introduced by consideration of the first part of the post code - which determines a wide area. I'm not clear whether the same applies to zip codes.

However this approach does not deal well with addresses that are out of the normal run - my own address is an example; I live on a boat, and as a consequence have some additional pieces of address in order to ensure correct addressing.

Anomalies of this sort are always likely to need manual intervention.

Incidentally, your assertion that it's a no-brainer to merge/rollup people whose First-Last is the same at the same address needs to be challenged. The most difficult cases we had in data cleansing were precisely where there were two people (eg father and son) of the same name living at the same address. Equally, if somebody of the same name bought a property (which happens) then again there are problems of "re-duplication".

Chris A., have you considered employing official expert systems at this task? Remarkably, as you're finding, standardizing addresses so you can iterate through them effectively gets very difficult very fast. At SmartyStreets (where I work), that's our business core: the implementation of certain algorithms which do this task.

This may not a direct answer to your exact question, but it's a vital step along the way that, in developing a fuzzy search query, you have good data to begin with. In other words, as Chris W. has shown in his answer, even after a fuzzy query, there's much left to be desired.

So I'd suggest first truly standardizing all the addresses (accounting for address "overloads" per-se, two addresses looking totally different, but are the same address). For US-based addresses, you could try a list processing service (like CASS-Certified Scrubbing; research for your own choice). A good one will flag duplicates for you, then let you take action. After the addresses have been normalized and flagged, you can much more quickly weed out the exact duplicates based on your business' definition (by family name, etc). At that point you'd run your fuzzy search against anything except addresses which are the trickiest, and you already have a good idea about what's a duplicate may be.