开发者

HowTo extract embedded OCR data from a PDF?

开发者 https://www.devze.com 2023-02-14 21:11 出处:网络
I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I ne

I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data. So my question is, is it possible to extract this embedded OCR-Data from the pdf Files? It would be nice to get files with coordinates. But it would also be sufficient to get 开发者_如何学Goplaintext files.


You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.

PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.

0

精彩评论

暂无评论...
验证码 换一张
取 消