Please 开发者_开发知识库suggest a good machine learning classifier for truecasing of dataset. Also, Is it possible to specify out own rules/features for truecasing in such a classifier? Thanks for all your suggestions.
Thanks
I implemented a version of a truecaser in Python. It can be trained for any language when you provide enough data (i.e. correctly cased sentences).
For English, it achieves an accuracy of 98.38% on sample sentences from Wikipedia. A pre-trained model for English is provided.
You can find it here: https://github.com/nreimers/truecaser
Please take a look at this whitepaper.
http://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf
They report 98% of accuracy.
精彩评论