开发者

What is proper Tokenization algorithm? & Error: TypeError: coercing to Unicode: need string or buffer, list found

开发者 https://www.devze.com 2023-01-22 14:15 出处:网络
I\'m doing an Information Retrieval Task. As part of pre-processing I want to doing. Stopword removal

I'm doing an Information Retrieval Task. As part of pre-processing I want to doing.

  1. Stopword removal
  2. Tokenization
  3. Stemming (Porter Stemmer)

Initially, I skipped tokenization. As a result I got terms like this:

broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.

So, Now I realized importance of tokenization. Is there开发者_如何转开发 any standard algorithm for tokenization for English language? Based on string.whitespace and commonly used puncuation marks. I wrote

def Tokenize(text):
    words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])    
    return [word.strip() for word in words if word.strip() != '']
  1. I'm getting TypeError: coercing to Unicode: need string or buffer, list found error!
  2. How can this Tokenization routine be improved?


There is no single perfect algorithm for tokenization, though your algorithm may suffice for information retrieval purposes. It will be easier to implement using a regular expression:

def Tokenize(text):
    words = re.split(r'[-\.,?!:;_()\[\]\'`"/\t\n\r \x0b\x0c]+', text)
    return [word.strip() for word in words if word.strip() != '']

It can be improved in various ways, such as handling abbreviations properly:

>>> Tokenize('U.S.')
['U', 'S']

And watch out what you do with the dash (-). Consider:

>>> Tokenize('A-level')
['A', 'level']

If 'A' or 'a' occurs in your stop list, this will be reduced to just level.

I suggest you check out Natural Language Processing with Python, chapter 3, and the NLTK toolkit.


As larsman mentions, ntlk has a variety of different tokenizers that accept various options. Using the default:

>>> import nltk
>>> words = nltk.wordpunct_tokenize('''
... broker
... broker'
... broker,
... broker.
... broker/deal
... broker/dealer'
... broker/dealer,
... broker/dealer.
... broker/dealer;
... broker/dealers),
... broker/dealers,
... broker/dealers.
... brokerag
... brokerage,
... broker-deal
... broker-dealer,
... broker-dealers,
... broker-dealers.
... brokered.
... brokers,
... brokers.
... ''')
['broker', 'broker', "'", 'broker', ',', 'broker', '.', 'broker', '/', 'deal',       'broker', '/', 'dealer', "'", 'broker', '/', 'dealer', ',', 'broker', '/', 'dealer', '.', 'broker', '/', 'dealer', ';', 'broker', '/', 'dealers', '),', 'broker', '/', 'dealers', ',', 'broker', '/', 'dealers', '.', 'brokerag', 'brokerage', ',', 'broker', '-', 'deal', 'broker', '-', 'dealer', ',', 'broker', '-', 'dealers', ',', 'broker', '-', 'dealers', '.', 'brokered', '.', 'brokers', ',', 'brokers', '.']

If you want to filter out list items that are punctuation only, you could do something like this:

>>> filter_chars = "',.;()-/"
>>> def is_only_punctuation(s):
        '''
        returns bool(set(s) is not a subset of set(filter_chars))
        '''
        return not set(list(i)) < set(list(filter_chars))
>>> filter(is_only_punctuation, words)

returns

>>> ['broker', 'broker', 'broker', 'broker', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'broker', 'dealers', 'brokerag', 'brokerage', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'brokered', 'brokers', 'brokers']
0

精彩评论

暂无评论...
验证码 换一张
取 消