开发者

NLP and Ruby to characterize quality of writing

开发者 https://www.devze.com 2023-02-10 05:09 出处:网络
I\'d like to take a shot at characterizing incoming documents in my app as either \"well\" or \"poorly\" written.I reali开发者_StackOverflowze this is no easy task, but even a rough idea would be usef

I'd like to take a shot at characterizing incoming documents in my app as either "well" or "poorly" written. I reali开发者_StackOverflowze this is no easy task, but even a rough idea would be useful. I feel like the way to do this would be via naïve Bayes classifier with two classes, but am open to suggestions. So two questions:

  1. is this method the optimal (taking into account simplicity) way to do this assuming a large enough training db?

  2. are there libraries in ruby (or any integratable JRuby or whatever) that i can plug into my rails app to make this happen with little fuss?

Thanks!


You might try using vocabulary vector analysis. Covered some here:

http://en.wikipedia.org/wiki/Semantic_similarity

Basically you build up a corpus of texts that you deem "well-written" or "poorly-written" and count the frequency of certain words. Make a normalized vector for each, and then compute the distance between those to the vectors of each incoming document. I am not a statistician, but I'm told it's similar to Bayesian filtering, but seems to deal with misspellings and outliers better.

This is not perfect, by any means. Depending on how accurate you need it to be, you will probably still need humans to make the final judgement. But we've had good luck using it as a pre-filter to reduce number of reviewers.


Another simple algorithm to check out is the Flesch-Kincaid readability metric. It is quite widely used and should be easy to implement. I assume one of the Ruby NLP libraries has syllable methods.


You may find interesting this Burstein, Chodorow, and Leacock on the Criterion essay evaluation system for a pretty interesting very high-level overview of how one particular system did essay evaluation as well as style correction.

0

精彩评论

暂无评论...
验证码 换一张
取 消