Data matching algorithm_问答_开发者_运维开发者技术经验分享

I am currently working on a project where I a data matching algorithm needs to be implemented. An external system passes in all data it knows about a customer, and the system I design has to return the customer matched. So the external system then knows the correct id of the customer plus it gets additional data or can update its own data of the specific customer.

The following fields are passed in:

Name
Name2
Street
City
ZipCode
BankAccountNumber
BankName
BankCode
Email
Phone
Fax
Web

The data can be of high quality and alot of information is available, but often the data is crappy and just the name and address is available and might have spellings.

I'm implementing the project in .Net. What I currently do is something like the following:

public bool IsMatch(Customer customer)
{
    // CanIdentify just checks if the info is provided and has a specific length (e.g. > 1)
    if (CanIdentifyByStreet() && CanIdentifyByBankAccountNumber())
    {
        // some parsing of strings done before (substring, etc.)
        if(Street == customer.Street && AccountNumber == customer.BankAccountNumber) return true;
    }
    if (CanIdentifyByStreet() && CanIdentifyByZipCode() &&CanIdentifyByName())
    {
        ...
    }
}

I am not very happy with the approach above. This is because I would have to write if statements for all reasonable cases (combinations) so I don't miss any chance of matching the entity.

So I thought maybe I could create some kind of matching score. So for each criteria matched, a score would be added. Like:

public bool IsMatch(Customer customer)
{
    int matchingScore = 0;
    if (CanIdentifyByStreet())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyByName())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyBankAccountNumber())
    {
        if(....)
            matchingScore += 10;
    }

    if(matchingScore > iDontKnow)
        return true;
}

This would allow me to take in consideration all matching data开发者_Go百科, and depending on some weight I would increase the matching score. If the score is high enough, it's a match.

Know my question is: Are there any best practices out there for such things, like matching algorithm patterns etc? Thanks alot!

For inspiration, look at the Levenshtein distance algorithm. This will give you a reasonable mechanism to weight your comparisons.

I would also add that in my experience you can never match two arbitrary pieces of data into the same entity with absolute certainty. You need to present plausible matches to a user, who can then verify for sure that John Smith on 1920 E. Pine is the same person as Jon Smith on 192 East Pine Road or not.

In my experience with this sort of thing, it was actually the business people who defined the rules of what was acceptible as a match, rather than it being a technical decision. This has made sense to me, since the business ends up assuming the risk. Also, what constitutes a match can be prone to change, like if they use the system and find that too many people are being excluded.

I think that your first approach makes more sense, in that if you can match someone by name and bank account number, then you're pretty sure it's them. However, if both the name and bank info don't match, but the address, phone, and all that matches (ie. a spouse) then the scoring system might incorrectly match people. I realize it's a lot of code, but so long as you extract out the actual matching code (matchPhoneNumber method, etc), then it's fine design-wise.

I would probably take it a step further and pull out the matching into an enum and then have lists of acceptable matches. Sort of like this: interface Match { boolean matches(Customer c1, Customer c2); }

class BankAccountMatch implements Match
{
    public boolean matches(Customer c1, Customer c2)
    {
        return c1.getBankAccountNumber() == c2.getBankAccountNumber();
    }
}

static Match BANK_ACCOUNT_MATCH = new BankAccountMatch();

Match[][] validMatches = new Match[] [] {
        {BANK_ACCOUNT_MATCH, NAME_MATCH},
        {NAME_MATCH, ADDRESS_MATCH, FAX_MATCH}, ...
};

And then the code that does the validation would just iterate over the validMatches array and test them to see if one fits. I might even pull out the lists of valid matches into a config file. That all depends on the level of robustness your system needs though.

If you limit yourself to the address and name you can just use the harvesine formula or a spatial index if you have the geolocation. For the name you can use a trie and get only the first results, maybe 10.

What about a machine learning approach. Create. Distances per item.

These become your input space. Build a training set on correctly matched custers based on these distances. Run through your favourite machine learner algo. Get your parameters for decision func which reflect strength of match. Tune. Apply to new cases. Go to the bank.