I'm doing an Information Retrieval Task. As part of pre-processing I want to doing.
- Stopword removal
- Tokenization
- Stemming (Porter Stemmer)
Initially, I skipped tokenization. As a result I got terms like this:
broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.
So, Now I realized importance of tokenization. Is there开发者_如何转开发 any standard algorithm for tokenization for English language? Based on string.whitespace
and commonly used puncuation marks. I wrote
def Tokenize(text):
words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])
return [word.strip() for word in words if word.strip() != '']
- I'm getting
TypeError: coercing to Unicode: need string or buffer, list found
error! - How can this Tokenization routine be improved?
There is no single perfect algorithm for tokenization, though your algorithm may suffice for information retrieval purposes. It will be easier to implement using a regular expression:
def Tokenize(text):
words = re.split(r'[-\.,?!:;_()\[\]\'`"/\t\n\r \x0b\x0c]+', text)
return [word.strip() for word in words if word.strip() != '']
It can be improved in various ways, such as handling abbreviations properly:
>>> Tokenize('U.S.')
['U', 'S']
And watch out what you do with the dash (-
). Consider:
>>> Tokenize('A-level')
['A', 'level']
If 'A'
or 'a'
occurs in your stop list, this will be reduced to just level
.
I suggest you check out Natural Language Processing with Python, chapter 3, and the NLTK toolkit.
As larsman mentions, ntlk has a variety of different tokenizers that accept various options. Using the default:
>>> import nltk
>>> words = nltk.wordpunct_tokenize('''
... broker
... broker'
... broker,
... broker.
... broker/deal
... broker/dealer'
... broker/dealer,
... broker/dealer.
... broker/dealer;
... broker/dealers),
... broker/dealers,
... broker/dealers.
... brokerag
... brokerage,
... broker-deal
... broker-dealer,
... broker-dealers,
... broker-dealers.
... brokered.
... brokers,
... brokers.
... ''')
['broker', 'broker', "'", 'broker', ',', 'broker', '.', 'broker', '/', 'deal', 'broker', '/', 'dealer', "'", 'broker', '/', 'dealer', ',', 'broker', '/', 'dealer', '.', 'broker', '/', 'dealer', ';', 'broker', '/', 'dealers', '),', 'broker', '/', 'dealers', ',', 'broker', '/', 'dealers', '.', 'brokerag', 'brokerage', ',', 'broker', '-', 'deal', 'broker', '-', 'dealer', ',', 'broker', '-', 'dealers', ',', 'broker', '-', 'dealers', '.', 'brokered', '.', 'brokers', ',', 'brokers', '.']
If you want to filter out list items that are punctuation only, you could do something like this:
>>> filter_chars = "',.;()-/"
>>> def is_only_punctuation(s):
'''
returns bool(set(s) is not a subset of set(filter_chars))
'''
return not set(list(i)) < set(list(filter_chars))
>>> filter(is_only_punctuation, words)
returns
>>> ['broker', 'broker', 'broker', 'broker', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'broker', 'dealers', 'brokerag', 'brokerage', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'brokered', 'brokers', 'brokers']
精彩评论