开发者

How do I use rstrip to remove trailing characters?

开发者 https://www.devze.com 2023-01-20 05:00 出处:网络
I am trying to loop through a bunch of documents I have to put each word in a list for that document. I am doing it like this. stoplist is just a list of words that I want to ignore by default.

I am trying to loop through a bunch of documents I have to put each word in a list for that document. I am doing it like this. stoplist is just a list of words that I want to ignore by default.

texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

I am returned with a list of documents, and in each of those lists, is a list of words. Some of the words still contain the trailing punctuation or other anomalies. I thought I could do this, but it doesn't seem to be working right

texts = [[word.rstrip() for word in document开发者_Go百科.lower().split() if word not in stoplist]
         for document in documents]

Or

texts = [[word.rstrip('.,:!?:') for word in document.lower().split() if word not in stoplist]
         for document in documents]

My other question is this. I may see words like this where I want to keep the word, but dump the trailing numbers / special characters.

agency[15]
assignment[72],
you’ll
america’s

So to clean up most of the other noise, I was thinking I should keep removing characters from the end of a string until it's a-zA-Z or if there is more special characters than alpha chars in a string, toss it. You can see though in my last two examples, the end of the string is an alpha character. So in those cases, I should just ignore the word because of the amount of special chars (more than alpha chars). I was thinking I should just search the end of strings because I would like to keep hyphenated words intact if possible.

Basically I want to remove all trailing punctuation on each word, and possibly a subroutine that handles the cases I just described. I am not sure how to do that or if its the best way.


>>> a = ['agency[15]','assignment72,','you’11','america’s']
>>> import re
>>> b = re.compile('\w+')
>>> for item in a:
...     print b.search(item).group(0)
...
agency
assignment72
you
america
>>> b = re.compile('[a-z]+')
>>> for item in a:
...     print b.search(item).group(0)
...
agency
assignment
you
america
>>>

Update

>>> a = "I-have-hyphens-yo!"
>>> re.findall('[a-z]+',a)
['have', 'hyphens', 'yo']
>>> re.findall('[a-z-]+',a)
['-have-hyphens-yo']
>>> re.findall('[a-zA-Z-]+',a)
['I-have-hyphens-yo']
>>> re.findall('\w+',a)
['I', 'have', 'hyphens', 'yo']
>>>


Maybe try re.findall instead, with a pattern like [a-z]+:

import re
word_re = re.compile(r'[a-z]+')
texts = [[match.group(0) for match in word_re.finditer(document.lower()) if match.group(0) not in stoplist]
          for document in documents]

texts = [[word for word in word_re.findall(document.lower()) if word not in stoplist]
          for document in documents]

You can then easily tweak your regular expression to get the words you want. Alternate version uses re.split:

import re
word_re = re.compile(r'[^a-z]+')
texts = [[word for word in word_re.split(document.lower()) if word and word not in stoplist]
          for document in documents]
0

精彩评论

暂无评论...
验证码 换一张
取 消