开发者

tika returning incorrect line of text for pdf with lots of tables

开发者 https://www.devze.com 2023-03-28 18:28 出处:网络
I am using tika 开发者_运维技巧to extract text from a pdf file that has lot of tables. java -jar tika-app-0.9.jar -t https://s3.amazonaws.com/centraldoc/alg1.pdf

I am using tika 开发者_运维技巧to extract text from a pdf file that has lot of tables.

java -jar tika-app-0.9.jar -t https://s3.amazonaws.com/centraldoc/alg1.pdf

It is returning some invalid text and sometimes it is trimming white space between 2 words; for example it returns "qu inakli fmyathematical ideas to the real world" instead of "Link mathematical ideas to the real world".

Is there a way to minimize this kind of error? or is there another library that I can use? Does it make sense to use OCR to process these kind of pdf.


Try to control order when using PDFBox parser: PDFTextStripper has a flag that controls the order of lines in the document. By default (in PDFBox) it's set to false for performance reasons (no order preserved), but Tika changed its behavior between releases switching this flag on and off.

More details exactly on this problem in my blog Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood).


To get text from PDF to display in the right order, I had to set the SortByPosition flag to true... (tika-app-1.19.jar)

                    BodyContentHandler handler   = new BodyContentHandler();
                    Metadata           metadata  = new Metadata();
                    ParseContext       context   = new ParseContext();
                    PDFParser          pdfParser = new PDFParser();

                    PDFParserConfig config = pdfParser.getPDFParserConfig();
                    config.setSortByPosition(true); // needed for text in correct order
                    pdfParser.setPDFParserConfig(config);

                    pdfParser.parse(is, handler, metadata, context);
0

精彩评论

暂无评论...
验证码 换一张
取 消