I am using tesse开发者_运维知识库ract 3.0.0 and I bumped into the following problem:
When there is something too small for tesseract to recognize it seems it's merged with other fragments. As a result nothing relevant is returned.
The image below shows 3 cases. Only the rectangle with the dashed line is passed to tesseract. Over the rectangle is the result (V over T means new line).
The last case is the problem one. Is there someway to improve tesseract in situations like this?
As far as I know, Tesseract does not have proper image segmentation yet (or Document Analysis, as it is called in commertial OCR applications.) Typically, before OCR is done, image is get's split on separate areas that contain text, pictures, barcodes, lines and so on. Then you apply OCR only on text ares and don't face problems you have just described.
Earlier versions of Tesseract did not have that functionality at all, and Tesseract was supposed to be used as line recognizer only, or so called field-level recognizer, when you use it on small snippets of text cut from bigger image.
I did not followed throughly what was introduced in 3.0, probably it is already there partially, but obviously it does not work as expected, as you have just found out.
There is another opensource project - OCRopus, that aproached this problem exactly as I described - first Document Analisys (aka Segmentation) and only then OCR. Their earlier versions were actually using Tesseract for OCR after analisys step finished. But later they introduced their own OCR (which is still not very good) and moved Tesseract plugin support down in priorities list.
Here's what you actually can do to address your problem:
- If your images have very typical structure, you can try to do some dumb segmentation and cut text from the image yourself before passing it to Tesseract. However, if you expect to have wide variety of images to be supported, just forget it.
- You can ckeck OCRopus and see if their segmentation work for your images. If yes, then you can spend some time to make OCRopus + Tesseract work together.
- Well, if what you do is not just for fun and you value your time, I would recommend thinking about real OCR engine like ABBYY. You will get much higher accuracy of both segmentaiton and OCR out of the box, and professional customer support of course.
Disclaimer: I work for ABBYY
精彩评论