Open source OCR [closed]_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-13 21:43 出处：网络

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

相关专题：ocr pdf ruby

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You ca开发者_运维百科n edit the question so it can be answered with facts and citations.

Closed 4 years ago.

Improve this question

I'm looking for an open source OCR library that runs on Linux. I need this to work for PNGs and PDFs. Mostly I would like to interface this library from java or ruby. Any idea if there is anything available?

Regards.

Tesseract is a very good OCR engine: https://github.com/tesseract-ocr/tesseract

The project has been launched by HP Labs and is now continued and sponsored by Google (for Google Books !). It is released under the Apache license, and it runs on Linux. It uses Tiff or PNGs files ; for PDFs, you will need to convert to one of these formats. I suppose that there is no binding so you should invoke this software as a subprogram...

Cuneiform is free and does a decent job. You could invoke it as a subprogram but there's no language binding that I know of. It won't read PDFs directly but you can easily take apart PDFs that are sequences of scanned images to feed them to Cuneiform. There are also scripts to reassemble the images and text back into a searchable PDF.

Try tesjeract, which uses JNI to call Tesseract OCR API.

For PDF, you'll need to convert them to image first, using GhostScript, for instance.