I am trying to loop through a bunch of documents I have to put each word in a list for that document. I am doing it like this. stoplist
is just a list of words that I want to ignore by default.
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
I am returned with a list of documents, and in each of those lists, is a list of words. Some of the words still contain the trailing punctuation or other anomalies. I thought I could do this, but it doesn't seem to be working right
texts = [[word.rstrip() for word in document开发者_Go百科.lower().split() if word not in stoplist]
for document in documents]
Or
texts = [[word.rstrip('.,:!?:') for word in document.lower().split() if word not in stoplist]
for document in documents]
My other question is this. I may see words like this where I want to keep the word, but dump the trailing numbers / special characters.
agency[15]
assignment[72],
you’ll
america’s
So to clean up most of the other noise, I was thinking I should keep removing characters from the end of a string until it's a-zA-Z or if there is more special characters than alpha chars in a string, toss it. You can see though in my last two examples, the end of the string is an alpha character. So in those cases, I should just ignore the word because of the amount of special chars (more than alpha chars). I was thinking I should just search the end of strings because I would like to keep hyphenated words intact if possible.
Basically I want to remove all trailing punctuation on each word, and possibly a subroutine that handles the cases I just described. I am not sure how to do that or if its the best way.
>>> a = ['agency[15]','assignment72,','you’11','america’s']
>>> import re
>>> b = re.compile('\w+')
>>> for item in a:
... print b.search(item).group(0)
...
agency
assignment72
you
america
>>> b = re.compile('[a-z]+')
>>> for item in a:
... print b.search(item).group(0)
...
agency
assignment
you
america
>>>
Update
>>> a = "I-have-hyphens-yo!"
>>> re.findall('[a-z]+',a)
['have', 'hyphens', 'yo']
>>> re.findall('[a-z-]+',a)
['-have-hyphens-yo']
>>> re.findall('[a-zA-Z-]+',a)
['I-have-hyphens-yo']
>>> re.findall('\w+',a)
['I', 'have', 'hyphens', 'yo']
>>>
Maybe try re.findall
instead, with a pattern like [a-z]+
:
import re
word_re = re.compile(r'[a-z]+')
texts = [[match.group(0) for match in word_re.finditer(document.lower()) if match.group(0) not in stoplist]
for document in documents]
texts = [[word for word in word_re.findall(document.lower()) if word not in stoplist]
for document in documents]
You can then easily tweak your regular expression to get the words you want. Alternate version uses re.split
:
import re
word_re = re.compile(r'[^a-z]+')
texts = [[word for word in word_re.split(document.lower()) if word and word not in stoplist]
for document in documents]
精彩评论