I have a ton of short stories about 500 words long and I want to categorize them into one of, let's say, 20 categories:
- Entertainment
- Food
- Music
- etc
I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What's the best way to approach this? Is there a standard approach to machine learning I shoul开发者_JS百科d be using? I don't think a decision tree would work well since it's text data...I'm completely new in this field.
Any help would be appreciated, thanks!
A naive Bayes will most probably work for you. The method is like this:
- Fix a number of categories and get a training data set of (document, category) pairs.
- A data vector of your document will be sth like a bag of words. e.g. Take the 100 most common words except words like "the", "and" and such. Each word gets a fixed component of your data vector (e.g. "food" is position 5). A feature vector is then an array of booleans, each indicating whether that word came up in the corresponding document.
Training:
- For your training set, calculate the probability of every feature and every class: p(C) = number documents of class C / total number of documents.
- Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word "food" is in the text) / number of documents in given class.
Decision:
- Given an unclassified document, the probability of it belonging to class C is proportional to P(C|F1, ..., F500) = P(C) * P(F1|C) * P(F2|C) * ... * P(F500|C). Pick the C that maximizes this term.
- Since multiplication is numerically difficult, you can use the sum of the logs instead, which is maximized at the same C: log P(C|F1, ..., F500) = log P(C) + log P(F1|C) + log P(F2|C) + ... + log P(F500|C).
I've classified tens of thousands of short texts. What I did intially was to use a tf-idf vector space model and then did k-means clustering on those vectors. This is a very good initial step of exploratory data analysis to get a nice handle on your dataset. The package I used to cluster was cluto: http://glaros.dtc.umn.edu/gkhome/views/cluto/
To do tf-idf, I just wrote a quick script in perl to tokenize on non-alphanumerics. Then, every document consists of a bag of words. Every document is represented as a vector of the words it contains. The value of each index of the vector is the term frequency (tf) * inverse document frequency (idf). It's just the product of the count of that word/term in the document multiplied by the reciprocal of the fraction of the documents that contain that word. (because a word like "the" is very uninformative.)
This method will quickly get you about 80%-90% accuracy. You can then manually label the ones that are right (or more importantly: wrong) and then do supervised learning if you so choose.
I think the paper "Machine learning in automated text categorization" (you can Google and download the PDF file) is worth reading. The paper discussed two crucial part:one for feature selection(translate text to feature space), the other for building a classifier on feature space.
there is a lot of feature selection methods, and several classification methods(decision tree, naive Bayes, kNN, SVM, etc). you can try some combination to see if it was working on your data set.
I did something similar before, I use Python for text manipulation, feature selection, and feature weighting. and Orange for classifier. Orange and Weka already included naive Bayes, kNN... , but nowadays I might write the classifier with Python script directly, it shouldn't be very hard too.
Hope this helps.
Most people will say that statistical text analysis (like a naive Bayes approach) is the standard approach: "Foundations of Statistical Natural Language Processing", Manning and Schuetze and "Speech and Language Processing", Jurafsky and Martin are the standard references. Statistical text analysis became the standard approach during the late 90's because they easily outperformed symbolic systems. However some symbolic systems incorporate statistical elements and you can also actually use a connectionist approach (there are a few papers demonstrating this). You can also use cosine similarity (a form of k-Nearest Neighbor) although naive Bayes is usually the top performer.
Here's a good overview: http://www.cs.utexas.edu/users/hyukcho/classificationAlgorithm.html I used rainbow mentioned on that page for text classification on a search engine prototype I wrote on a dot com project.
Unless there is a chance that you want to do another 500 classifications in the future I am not sure I would go for a machine learning approach.
Unless the categories are very similar ("food" and "italian food" to take an example) I think a quite naive heuristic could work very well.
For each category build a table of common words (for food : "potato", "food", "cook", "tomato", "restaurant",...) and for each text count which category got the most word matches. Instead of building the dictionary by hand you could take a sample (say a 100) of the texts, categorize them by hand and then let an algorithm pick out the words and then make sure to remove words that are common between all sets (since they provide no information). This is, in essence, a very simple "learning" system.
If you really want a machine learning system there are a number of methods for classification. The downside is that although most methods are quite simple to implement, the hard part is to chose a good method, the right features, and good parameters.
Try Weka... it's a free data mining tool that implements a lot of machine learning algorithms. It has a GUI and an API, so you can use it directly to on your data set or you can program against it.
If you like the results from the various machine learning algorithms and you're still interested in implementing your own algorithms, then you can implement the one(s) that you like the most. This will also help you remove some of the "will it actually work" feeling that you normally get before you build an ML/AI algorithm.
We can use NLP here. Following would be the steps as I implemented to classify emails in different categories here: 1. Lemmatization: This would remove unnecessary details and would convert all the words into their basic forms or root forms. Like, it will convert working into work, running into run, horses into horse etc. We can Stanford Lemmatizer for this purpose. http://stanfordnlp.github.io/CoreNLP/
Wordnet filtering: We can use only those words which are present in Wordnet. I used Java Wordnet Interface for this purpose. Just filter out the words that are not found in wordnet and take rest of the words. http://projects.csail.mit.edu/jwi/
Find synonyms and further synonyms: For each of the above 5 or 6 categories mentioned above, form separate sets containing synonyms of synonyms of these categories. For e.g., form a set that would contain synonyms of Entertainment and then further synonyms of the synonyms of entertainment found. We can increase this set using web crawling as well.
- Feed the data: Take all the words after Lemmatization and Wordnet filtering of a particular story and check how many words matches in each category sets. For e.g., if a story contains 100 words, and it matches with 35 words in entertainment category, 40 words with food, 30 words with travel, then it is highly likely to fall under the category of food and hence it would be a food story. I got good results for my email classification using above approach.
If you're looking for something off the shelf, you might want to try Microsoft's data mining algorithms in SQL Server:
http://msdn.microsoft.com/en-us/library/ms175595%28v=SQL.100%29.aspx
http://www.sqlserverdatamining.com
精彩评论