开发者

What is a good strategy to group similar words?

开发者 https://www.devze.com 2023-03-18 13:06 出处:网络
Say I have a list of movie names with misspellings and small variations like this - \"Pirates of the Caribbean: The Curse of the Black Pearl\"

Say I have a list of movie names with misspellings and small variations like this -

 "Pirates of the Caribbean: The Curse of the Black Pearl"
 "Pirates of the carribean"
 "Pirates of the Caribbean: Dead Man's Chest"
 "Pirates of the Caribbean trilogy"
 "Pirates of the Caribbean"
 "Pirates Of The Carribean"

How do I group or find such 开发者_JAVA技巧sets of words, preferably using python and/or redis?


Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.

I'm especially fond of the difflib module

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison


You might notice that similar strings have large common substring, for example:

"Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)

To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance.

Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.

So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).


I believe there is in fact two distinct problems.

The first is spell correction. You can have one in Python here

http://norvig.com/spell-correct.html

The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.

related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:

https://code.google.com/p/tfidf/


To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :

def dosearch(terms, searchtype, case, adddir, files = []):
    found = []
    if files != None:
        titlesrch = re.compile('>title<.*>/title<')
        for file in files:
            title = ""
            if not (file.lower().endswith("html") or file.lower().endswith("htm")):
                continue
            filecontents = open(BASE_DIR + adddir + file, 'r').read()
            titletmp = titlesrch.search(filecontents)
            if titletmp != None:
                title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
            filecontents = remove_tags(filecontents)
            filecontents = filecontents.lstrip()
            filecontents = filecontents.rstrip()
            if dofind(filecontents, case, searchtype, terms) > 0:
                found.append(title)
                found.append(file)
    return found

Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/

Regards,

Max


One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.

Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.

0

精彩评论

暂无评论...
验证码 换一张
取 消