开发者

How to incrementally train an nltk classifier

开发者 https://www.devze.com 2023-02-09 05:10 出处:网络
I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier.I am able to train on corpus data and classify another set of 开发者_开发技巧data but

I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of 开发者_开发技巧data but would like to feed additional training information into the classifier after initial training.

If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?

I'm open to suggestions including other classifiers that can accept new training data over time.


There's 2 options that I know of:

1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.

2) Externalize the internal model, then update it manually. The NaiveBayesClassifier can be created directly by giving it a label_prodist and a feature_probdist. You could create these separately, pass them in to a NaiveBayesClassifier, then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at the train method for details on how to update the probability distributions.


I'm just learning NLTK, so please correct me if I'm wrong. This is using the Python 3 branch of NLTK, which might be incompatible.

There is an update() method to the NaiveBayesClassifier instance, which appears to add to the training data:

from textblob.classifiers import NaiveBayesClassifier

train = [
    ('training test totally tubular', 't'),
]

cl = NaiveBayesClassifier(train)
cl.update([('super speeding special sport', 's')])

print('t', cl.classify('tubular test'))
print('s', cl.classify('super special'))

This prints out:

t t
s s


As Jacob said, the second method is the right way And hopefully someone write a code

Look

https://baali.wordpress.com/2012/01/25/incrementally-training-nltk-classifier/

0

精彩评论

暂无评论...
验证码 换一张
取 消