I'm converting several large ETL processes from SSIS over to Rhino-Etl. I got fed up with the "grahical programming" nature of SSIS. Hunting through hundreds of dialog boxes to figure out what the ETL process is doing gets old in a hurry.
Several of the SSIS packages use the Fuzzy Match component for joining user accounts based on first name, last name, and DOB. The fuzzy match is necessary because the names are not always spelled the same across the two systems. There might be a legal name in one and a nick name in the other, typos, marriage name changes, etc.
The SSIS fuzzy lookup component does an OK job of handling these. It's not perfect, but it still greatly reduces the number of records that require a human's help.
Neither .Net nor Rhino-Etl have anything like this out of the box, so I'm looking for suggestions.
I have tried using the SoundEx t-sql function, but it does开发者_JAVA技巧n't produce very good results.
I'm currently considering using Levenshtein or possibly a Lucene.Net index.
Does anyone have any advice on either of those? Please feel free to suggest something completely different as well.
Have a look at open source project Sam's String Metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html.
I answered a similar question here: Are there any Fuzzy Search or String Similarity Functions libraries written for C#?
I would also do some research on Double Metaphone http://en.wikipedia.org/wiki/Double_Metaphone
We use Levenshtein to find similar strings in our application code (actually a database that contains all the strings). It works quite good, provided that you play a bit with the distance, e.g. allow longer strings to have more differences.
What Levenshtein doesn't do is handle synonyms, like "child" and "kid" or "car" and "wagon". In your case you probably also want to handle synonyms like "Bill" and "William", so you will probably have to add this yourself.
精彩评论