I am currently trying to build a general purpose (or as general as is practical) POS tagger with NLTK. I have dabbled with the brown and treebank corpora for training, but will probably be settling on the treebank corpus.
Learning as I go, I am finding the classifier POS taggers are the most accurate. The Maximum Entity classifier is meant to be the most accurate, but I find it uses so much memory (and processing time) that I have to significantly reduce the training dataset, so the end result is less accurate than using the default Naive Bayes classifier.
It has been suggested that I use MEGAM. NLTK has some support for MEGAM, but all the examples I have found are for general classifiers (eg. a text classifier that uses a vector of word features), rather than a more specific POS tagger. Without having to recreate my own POS feature extractor and compiler (ie. I prefer to use the one already in NLTK), how can I used the MEGAM MaxEnt classifier? Ie. how can I drop it in some existing MaxEnt code that is along the lines of:
maxent_tagger = ClassifierBasedPOSTagger(train=train开发者_运维百科ing_sentences,
classifier_builder=MaxentClassifier.train )
This one liner should work for training a MEGAM MaxentClassifier for the ClassifierBasedPOSTagger. Of course, that assumes MEGAM is already installed (go here to download)
maxent_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=lambda train_feats: MaxentClassifier.train(train_feats, algorithm='megam', max_iter=10, min_lldelta=0.1))
For the future users:
Megam is now available on MAC:
$brew tap homebrew/science
$brew install megam
If you dont have XQuartz, it might ask you to get that first. Here is the direct download link: http://xquartz.macosforge.org/downloads/SL/XQuartz-2.7.5_rc4.dmg
精彩评论