开发者

Python- Code Optimization Help- Find all dictionary-valid anagrams of a word

开发者 https://www.devze.com 2023-03-07 22:44 出处:网络
I solved the problem using this (horribly inefficient method): def createList(word, wordList): #Made a set, because for some reason permutations was returning duplicates.

I solved the problem using this (horribly inefficient method):

def createList(word, wordList):
    #Made a set, because for some reason permutations was returning duplicates.  
    #Returns all permutations if they're in the wordList

    return set([''.join(item) for item in itertools.permutations(word) if ''.join(item) in wordList])

def main():

    a = open('C:\\Python32\\megalist.txt', 'r+')
    wordList = [line.strip() for line in a]
    maximum = 0
    length = 0
    maxwords = ""

    for words in wordList:
        permList = createList(words, wordList)
        length = len(permList)
        if length > maximum:
            maximum = length
            maxwo开发者_开发技巧rds = permList
    print (maximum, maxwords)

It took something like 10 minutes to find the five-letter word that has the most dictionary-valid anagrams. I want to run this on words without a letter constraint, but it would take a ludicrous amount of time. Is there anyway to optimize this?


This following seems to work OK on a smallish dictionary. By sorting the letters in the word, it becomes easy to test if two words are an anagram. From that starting point, it's just a matter of accumulating the words in some way. It wouldn't be hard to modify this to report all matches, rather than just the first one

If you do need to add constraints on the number of letters, the use of iterators is a convenient way to filter out some words.

def wordIterator(dictionaryFilename):
    with open(dictionaryFilename,'r') as f:
        for line in f:
            word = line.strip()
            yield word

def largestAnagram(words):
    import collections
    d = collections.defaultdict(list)
    for word in words:
        sortedWord = str(sorted(word))
        d[ hash(sortedWord) ].append(word)
    maxKey = max( d.keys(), key = lambda k : len(d[k]) )
    return d[maxKey]

iter = wordIterator( 'C:\\Python32\\megalist.txt' )
#iter = ( word for word in iter if len(word) == 5 )
print largestAnagram(iter)

Edit:

In response to the comment, the hash(sortedWord), is a space saving optimization, possibly premature in this case, to reduce sortedWord back to an integer, because we don't really care about the key, so long as we can always uniquely recover all the relevant anagrams. It would have been equally valid to just use sortedWord as the key.

The key keyword argument to max lets you find the maximum element in a collection based on a predicate. So the statement maxKey = max( d.keys(), key = lambda k : len(d[k]) ) is a succinct python expression for answering the query, given the keys in the dictionary, which key has the associated value with maximum length?. That call to max in that context could have been written (much more verbosely) as valueWithMaximumLength(d) where that function was defined as:

def valueWithMaximumLength( dictionary ):
    maxKey = None
    for k, v in dictionary.items():
        if not maxKey or len(dictionary[maxKey]) < len(v):
            maxKey = k
    return maxKey


wordList should be a set.

Testing membership in a list requires you to scan through all the elements of the list checking whether they are the word you have generated. Testing membership in a set does not (on average) depend on the size of the set.

The next obvious optimisation is that once you have tested a word you can remove all its permutations from wordList, since they will give exactly the same set in createList. This is a very easy operation if everything is done with sets -- indeed, you simply use a binary minus.


There is no need to do ALL permutations, - it's a waste of memory and CPU. So, first of all your dictionary should be kept in a binary tree, like this:

   e.g. Dict = ['alex', 'noise', 'mother', 'xeal', 'apple', 'google', 'question']

   Step 1: find the "middle" word for the dictionary, e.g. "mother", because "m" 
           is somewhere in the middle of the English alphabet 
           (this step is not necessary, but it helps to balance the tree a bit)

   Step 2: Build the binary tree:

               mother 
              /      \
             /        \
         alex          noise
           \            /  \
            \          /    \
           apple   question xeal
             \
              \
             google

  Step 3: start looking for an anagram by permutations:
  alex: 1. "a"... (going deeper into binary tree, we find a word 'apple')
        1.1 # of, course we should omit apple, because len('apple') != len('alex')
            #  but that's enough for an example:
        2. Check if we can build word "pple" from "lex": ("a" is already reserved!)
           -- there is no "p" in "lex", so skipping, GOTO 1            
        ... 
        1. "l"... - nothing begins with "l"...
        1. "l"... - nothing begins with "e"...
        1. "x" - going deeper, "xeal" found
        2. Check if "eal" could be built from "ale" ("x" is already reserved)
           for letter in "eal":
               if not letter in "ale":
                  return False
           return True

That's it :) This algorithm should work much faster.

EDIT:

Check bintrees package to avoid spending time on binary tree implementation.

0

精彩评论

暂无评论...
验证码 换一张
取 消