I'm looking at nltk for python, but it splits(tokenize) won't
as ['wo',"n't"]
. Are there libraries that do this more robustly?
I know i can build a regex of some sort to solve this problem, but I'm looking for a library/tool because it would be a more directed approach. For example, after a basic regex with periods and commas, I realized words like 'Mr. ' will break 开发者_JAVA百科the system.
(@artsiom)
If the sentence was "you won't?", split() will give me ["you", "won't?"]. So there's an extra '?' that I have to deal with. I'm looking for a tried and tested method which do away with the kinks like the above mentioned and also the lot many exceptions that I'm sure exist. Of course, I'll resort to a split(regex) if I don't find any.
The Natural Language Toolkit (NLTK) is probably what you need.
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("'Hello. This is a test. It works!")
["'Hello", '.', 'This', 'is', 'a', 'test', '.', 'It', 'works', '!']
>>> word_tokenize("I won't fix your computer")
['I', 'wo', "n't", 'fix', 'your', 'computer']
nltk.tokenize.word_tokenize
by default use the TreebankWordTokenizer
, a word tokenizer that tokenizes sentences with the Penn Treebank conventions.
Note that this tokenizer assumes that the text has already been segmented into sentences.
You can test some of the various tokenizers provided by NLTK (i.e. WordPunctTokenizer
, WhitespaceTokenizer
...) on this page.
Despite what you say, NLTK is by far your best bet. You will not find a more 'tried and tested' method than the tokenizers in there (since some are based on calssifiers trained especially for this). You just need to pick the right tokenizer for you needs. Let's take the following sentence:
I am a happy teapot that won't do stuff?
Here is how the various tokenizers in NLTK will split it up.
TreebankWordTokenizer
I am a happy teapot that wo n't do stuff ?
WordPunctTokenizer
I am a happy teapot that won ' t do stuff ?
PunktWordTokenizer
I am a happy teapot that won 't do stuff ?
WhitespaceTokenizer
I am a happy teapot that won't do stuff?
Your best bet might be a combination of approaches. For example you might use the PunktSentenceTokenizer to tokenize your sentences first, this tends to be extremle accurate. Then for each sentence remove the punctuation chars at the end if any. Then use the WhitespaceTokenizer, that way you'll avoid the final punctuation/word combining e.g. stuff?
, since you will have removed the final punctuation chars from each sentence, but you still know where the sentences are delimited (e.g. store them in an array) and you won't have words such as won't
broken up in unexpected ways.
@Karthick, here is a simple algorithm I used long ago to split a text into a wordlist:
- Input text
- Iterate through the text character by character.
- If the current character is in "alphabet", then append it to a word. Else - add the previously created word to a list and start a new word.
alphabet = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
text = "I won't answer this question!"
word = ''
wordlist = []
for c in text:
if c in alphabet:
word += c
else:
if len(word) > 0:
wordlist.append(word)
word = ''
print wordlist
['I', "won't", 'answer', 'this', 'question']
It's just a launchpad and you can definitely modify this algorithm to make it smarter :)
NLTK comes with a number of different tokenizers, and you can see demos for each online at text-processing.com word tokenization demo. For your case, it looks like the WhitespaceTokenizer
is best, which is essentially the same as doing string.split()
.
You can try this:
op = []
string_big = "One of Python's coolest features is the string format operator This operator is unique to strings"
Flag = None
postion_start = 0
while postion_start < len(string_big):
Flag = (' ' in string_big)
if Flag == True:
space_found = string_big.index(' ')
print(string_big[postion_start:space_found])
#print(space_found)
op.append(string_big[postion_start:space_found])
#postion_start = space_found
string_big = string_big[space_found+1:len(string_big)]
#print string_big
else:
op.append(string_big[postion_start:])
break
print op
精彩评论