I am a final year CS student, and very int开发者_JAVA技巧erested about OCR and NLP stuffs.
The problem is I don't know anything about OCR yet and my project duration is only for 5 months. I would like to know OCR & NLP stuff that is viable for my project?
Is writing a (simple) OCR engine for a single language too hard for my project? What about adding a language support for existing FOSS OCR softwares?
My background is in the commercial side of OCR and in my experience writing anything but a simple OCR engine would take a fair amout of time. To get even reasonable results your input files would have to contain very clean text characters for the purposes of OCR or you would need lots of marked up training data to train the engine. This would limit your input data available using OCR to high quality printed documents and computer generated documents such as exporting a Word document to a TIFF image. Commercial OCR engines do a much better job reading standard scanned invoices and letters than even Tesseract OCR and they still make mistakes.
You could write a simple OCR engine and use NLP and language analysis to show how it can improve the OCR results. Most of the OCR engines are doing this anyway but it could be an interesting project. The commercial engines have had years of fine tuning to improve their recognition accuracy and they use every trick they can think of.
This article may give you some ideas on one way how to write an OCR engine:
http://www.codeproject.com/KB/dotnet/simple_ocr.aspx
You may be able to contribute to the Tesseract project but you would first need to research what has already been included and what is not and if anyone else is working on the same problem.
精彩评论