开发者

Python- need fast algorithm that removes all words in a file that are derivatives in other words

开发者 https://www.devze.com 2023-02-06 08:40 出处:网络
We have a file named wordlist, which contains 1,876 KB worth of alphabetized words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construction (ab,

We have a file named wordlist, which contains 1,876 KB worth of alphabetized words, all of which are longer than 4 letters and contain one carriage return between each new two-letter construction (ab, ac, ad, etc., words all contain returns between them):

 wfile = open("wordlist.txt", "r+")

I want to create a new file that contains only words that are not derivatives of other, smaller words. Fo开发者_StackOverflowr example, the wordlist contains the following words ["abuser, abused, abusers, abuse, abuses, etc.] The new file that is created should retain only the word "abuse" because it is the "lowest common denominator" (if you will) between all those words. Similarly, the word "rodeo" would be removed because it contains the word rode.

I tried this implementation:

def root_words(wordlist):
    result = []
    base = wordlist[1]
    for word in wordlist:
        if not word.startswith(base):
            result.append(base)
            print base
            base=word
    result.append(base)
    return result;


def main():
    wordlist = []
    wfile = open("wordlist.txt", "r+")

    for line in wfile:
        wordlist.append(line[:-1])

    wordlist = root_words(wordlist)
    newfile = open("newwordlist.txt", "r+")    
    newfile.write(wordlist)

But it always froze my computer. Any solutions?


I would do something like this:

def bases(words):
    base = next(words)
    yield base
    for word in words:
        if word and not word.startswith(base):
            yield word
            base = word


def get_bases(infile, outfile):
    with open(infile) as f_in:
        words = (line.strip() for line in f_in)
        with open(outfile, 'w') as f_out:
            f_out.writelines(word + '\n' for word in bases(words))

This goes through the corncob list of 58,000 words in a fifth of a second on my fairly old laptop. It's old enough to have one gig of memory.

$ time python words.py

real        0m0.233s
user        0m0.180s
sys         0m0.012s

It uses iterators everywhere it can to go easy on the memory. You could probably increase performance by slicing off the end of the lines instead of using strip to get rid of the newlines.

Also note that this relies on your input being sorted and non-empty. That was part of the stated preconditions though so I don't feel too bad about it ;)


One possible improvement is to use a database to load the words and avoid loading the full input file in RAM. Another option is to process the words as you read them from the file and write the results without loading everything in memory.

The following example treats the file as it is read without pre-loading stuff in memory.

def root_words(f,out):
    result = []
    base = f.readline()
    for word in f:
        if not word.startswith(base):
            out.write(base + "\n")
            base=word
    out.write(base + "\n")

def main():
    wfile = open("wordlist.txt", "r+")
    newfile = open("newwordlist.txt", "w")
    root_words(wfile,newfile)
    wfile.close()
    newfile.close()

Memory complexity of this solution is O(1) since the variable base is the only thing that you need in order to process the file. This can be done thanks to that the file is alphabetically sorted.


since the list is alphabetized, this does the trick (takes 0.4seconds with 5 megs of data, so should not be a problem with 1.8)

res = [" "]

with open("wordlist.txt","r") as f:
    for line in f:
        tmp = line.strip()
        if tmp.startswith(res[-1]):
            pass
        else:
            res.append(tmp)

with open("newlist.txt","w") as f:
    f.write('\n'.join(res[1:]))
0

精彩评论

暂无评论...
验证码 换一张
取 消