开发者

Project Gutenberg Python problem?

开发者 https://www.devze.com 2023-01-12 14:08 出处:网络
I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a hard time with a problem. First

I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a hard time with a problem. First, here is my algorithm:

  1. Enter a sentence as input -this is called trigger string-

  2. Get longest word in trigger string

  3. Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-

  4. Return the longest sentence that has the w开发者_JS百科ord I spoke about in step 3

  5. Append the sentence in Step 1 and Step4 together

  6. Repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-

So far I have been able to do this for first two sentences but I cannot perform a case insensitive search. Entire sentence database of Project Gutenberg is available via gutenberg.sents() function but regex - case insensitive search is practically impossible since the gutenberg.sents() outputs the sentences in books as following -in a list of list format-:

EXAMPLE: all the sentences of shakespeare's macbeth is called by typing

import nltk

from nltk.corpus import gutenberg 

gutenberg.sents('shakespeare-macbeth.txt') 

into the python shell command line and output is:

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], 
['Actus', 'Primus', '.'], .......] 

with [The Tragedie of Macbeth by William Shakespare, 1603] and Actus Primus. being the first two sentences.

How can I find the word I'm looking for regardless of it being uppercase/lowercase ? I'm desperately in need of help since I have been tinkering with this for the past two days and it's starting to wear on my nerves. Thanks a lot.


Given a list L of words, and a target word t,

any(t.lower()==w.lower() for w in L)

tells you whether L has word t in a case-insensitive way. It's faster, of course, to do

lt = t.lower()
any(lt==w.lower() for w in L)

since Python does not "hoist" the constant computation out of the loop and, unless you hoist it yourself, it will be performed repeatedly.

Given a list of lists lol, the longest sub-list including t can be found by

longest = max((L for L in lol if any(lt==w.lower() for w in L)), key=len)

If multiple sub-lists include t and are of the same maximal length, this will give you the first one, as it happens.


How about using the built-in function: str.lower()¶ Return a copy of the string converted to lowercase.

Then just compare the strings.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号