I have an application that will store and track visitors. These visitors are created in the system by schedulers(users) as needed when they set up a visit. The problem is that most of the time the only important unique identifiers of a visitor are as follows:
- First Name
- Last Name
- Company Name
The risk of duplicate records existing for the same person is inherent, a scheduler may enter a new visitor record in lieu of searching the system for somebody existing by that name.
When I encounter somebody entering a visitor by the same name I display a warning dialog with various suggestions of who this person COULD be, but then even that is not good enough.
I could enter 'Jim Jones' and this person may exist in the system as 'James Jones' or 'Jimmy Jones'. I see there are name recognitio开发者_C百科n software packages available but they are expensive and certainly more heavy than what I am looking for.
Would anybody know where to find a free or open source dictionary file that I can programatically access to find potential name variants? Software or an online service would be nice but even just a data dump or simple text file might do.
I know even this will not prevent duplicate visitor records, I am just trying to keep that at a minimum so it is not a critical feature.
Check out the Moby project (http://icon.shef.ac.uk/Moby/mwords.html) for common first and last names. You can do a precomputation for similar names using tools like metaphone and soundex and use that to identify potential matches. You also mention company names which are a bit harder to manage since they can be made up of lots of things, for that maybe check out the 12-dicts word list (http://wordlist.sourceforge.net/) the 2+2lemma list provided in that package provides multiple forms that share common roots which can be used in conjunction with a simiar spelling solution to provide improved results.
精彩评论