I'm running into an odd assertion error when using NLTK to process around 5000 posts with the PlainTextCorpusReader. With some of our datasets we don't have any major issues. However, on开发者_运维问答 the rare occasion I'm met with:
File "/home/cp-staging/environs/cpstaging/lib/python2.5/site-packages/nltk/tag/api.py", line 51, in batch_tag
return [self.tag(sent) for sent in sentences]
File "nltk/corpus/reader/util.py", line 401, in iterate_from
File "nltk/corpus/reader/util.py", line 343, in iterate_from
AssertionError
My code works (basically) like so:
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents()
tag0 = ArcBaseTagger('NN')
tag1 = nltk.UnigramTagger(brown_tagged_sents, backoff=tag0)
posts = PlaintextCorpusReader(posts_path, '.*')
tagger = nltk.BigramTagger(brown_tagged_sents, backoff=tag1)
tagged_sents = tagger.batch_tag(posts.sents())
It seems like nltk is losing its place in the file buffer, but I'm not 100% on that. Any idea what might cause this to happen? It almost seems like it has to have something to do with the data I'm processing. Maybe some funky characters?
I also faced this problem when one write function was making my corpora empty. making sure the file we are reading is not empty can avoid this error.
Removed some empty files from the parsing, problem solved.
精彩评论