开发者

How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?

开发者 https://www.devze.com 2023-03-23 22:03 出处:网络
I was granted with the beautiful task ;-) to design some tables in a MySQL Database which should hold human names.

I was granted with the beautiful task ;-) to design some tables in a MySQL Database which should hold human names.

Criteria:

  1. I have only the full names. (There is no separation for e.g. prename, surname and so on)
  2. The storage should be diacritic sensitive. (The following names stand for different persons)

    • "Voss" and "Voß".
    • "Joel" and "Joël".
    • "franc" and "Franc" and "Fránc".
  3. A search should return all similar names to the search string: E.g: Search for "franc" should return ["franc", "Franc", "Fránc"] and so on... (It would be awesome if the search would return not only the diacritice insensitive matches but perhaps similar sounding names or names that match in parts to the search string, too...)

I thougt of using the COLLATION utf8_bin for the column (declared as unique) in which I will store the names. This would satisfy point 2. But this will hurt point three. Declaring the column name as unique with collation utf8_unicode_ci satisfys point 3. but it hurts point two.

So my question is: Is there a way to solve this task and respecting all criteria? And since I don't want to reinvent the wheel: Is there an elegant way to handle human names (and their searches) in databases? (Sadly, I do not have the possibility of splitting the names into prename, surnames and optional middlenames...)

Edit开发者_JAVA百科:

The amount of names is arount a million (~1.000.000) entrys. And if it matters: I am using python as scripting language to populate the database and query the data later on.


What is useful is if you can decompose the full name into component "name words" and store a phonetic encoding (metaphone or one of the many other choices) for each of them. You just need the notion of name words though, not specifically categorizing it as first or middle or last, which is fine because those categories don't work well across cultures anyway). But you can use positional order information later in ranking if you want so that searching for "Paul Carl" matches "Paul Karl" better than matching "Carl Paul". You need to be aware of ambiguous punctuation that may require storing multiple versions of some name words. For instance Bre-Anna Heim would be broken into the name words "bre" "anna" "breanna" and "heim". Sometimes the dash is irrelevant like Bre-Anna, but sometimes not like in Sally-June". Bre-Anna never uses just Bre or Anna, but Sally-June may just use Sally or just June sometimes. It's hard to know which, so cover both possibilities.

You can write your query against this by similarly decomposing and phonetically encoding the full name you're searching for. Your query can return, say, those full names that have two or more component name phonetic matches (or one if there is only one name in the search or the source). This gives you a subset of full names to consider further. You could come up with a simple ranking of them, or even do something like a distance matching algorithm on this subset, which would be too expensive computationally to do against the entire million names. When I say distance matching, I'm talking on-line algorithms like Levenshtein distance and the like.

(edit) The reasoning for this is handling cases like the following name: Maria de los Angeles Gomez-Rodriguez. One data entry person may just enter Maria Gomez. Another might enter Maria Gomez Rodriguez. Yet another might enter Maria Angeles Rodrigus.


You can use an algorithm like Metaphone (or Double Metaphone) in another column so that you can try to find names that are "similar" to each other. You will have to look for an international version that knows about the german esset character.

0

精彩评论

暂无评论...
验证码 换一张
取 消