We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You ca开发者_运维百科n edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this questionI'm looking for an open source OCR library that runs on Linux. I need this to work for PNGs and PDFs. Mostly I would like to interface this library from java or ruby. Any idea if there is anything available?
Regards.
Tesseract is a very good OCR engine: https://github.com/tesseract-ocr/tesseract
The project has been launched by HP Labs and is now continued and sponsored by Google (for Google Books !). It is released under the Apache license, and it runs on Linux. It uses Tiff or PNGs files ; for PDFs, you will need to convert to one of these formats. I suppose that there is no binding so you should invoke this software as a subprogram...
Cuneiform is free and does a decent job. You could invoke it as a subprogram but there's no language binding that I know of. It won't read PDFs directly but you can easily take apart PDFs that are sequences of scanned images to feed them to Cuneiform. There are also scripts to reassemble the images and text back into a searchable PDF.
Try tesjeract, which uses JNI to call Tesseract OCR API.
For PDF, you'll need to convert them to image first, using GhostScript, for instance.
精彩评论