As a part of my academic research project, I am trying to build an application wherein I will have a set of urls retr开发者_如何转开发ieved from the web. The task is classify each of these urls into some category.
For Instance, the following URL is regarding cricket http://www.espncricinfo.com/icc_cricket_worldcup2011/content/current/story/499851.html If I give this particular URL to the classifier, it should give the output category as "Sports".
For this I am using the lingpipe classifier. I have followed the classification tutorial and ran the demo present in the demo folder. I have downloaded 20 news data set downloaded from the following link. http://people.csail.mit.edu/people/jrennie/20Newsgroups
Later, I have decreased the training sample size from 20 to 8 and have run the classification demo. It could successfully train the data and could test the data also.
But the thing is that, do I need to train the classifier every time I want to test the category of documents? If I run the classification of documents it takes 4 minutes for both training and testing the data.
Can I store the trained data once and perform the classification several times?
You need to serialize the the trained models to disk and then you can deserialize them and have the classifier ready to go.
Once you have a classifier trained up use
AbstractExternalizable.compileTo(classifier,modelFile);
To write the model to disk.
To read in you will need
AbstractExternalizable.readObject(modelFile);
Look at the Java doc for AbstractExternalizable
.
The model will not be able to accept additional training events because it has been compiled.
精彩评论