I'm trying to classify an example, which contains discrete and continuous features. Also, the example repre开发者_高级运维sents sparse data, so even though the system may have been trained on 100 features, the example may only have 12.
What would be the best classifier algorithm to use to accomplish this? I've been looking at Bayes, Maxent, Decision Tree, and KNN, but I'm not sure any fit the bill exactly. The biggest sticking point I've found is that most implementations don't support sparse data sets and both discrete and continuous features. Can anyone recommend an algorithm and implementation (preferably in Python) that fits these criteria?
Libraries I've looked at so far include:
- Orange (Mostly academic. Implementations not terribly efficient or practical.)
- NLTK (Also academic, although has a good Maxent implementation, but doesn't handle continuous features.)
- Weka (Still researching this. Seems to support a broad range of algorithms, but has poor documentation, so it's unclear what each implementation supports.)
Weka (Java) satisfies all you requirements:
- a large number of classification/regression algorithms
- support for discrete/continous (called nominal/numeric in Weka) attributes
- handles sparse data: ARFF format
Check out this Pentaho wiki for a list of links to documentations, guides, video tutorials, etc ...
Support vector machines? libsvm can be used from Python, and is quite speedy.
Handles sparse vector inputs, and won't mind if some of the features are continuous, where others are just -1/+1. (If you've got an n-way discrete feature, the standard thing to do is expand it into n binary features.)
scikit-learn, a Python machine learning module supports Stochastic Gradient Descent and Support Vector machines for sparse data.
精彩评论