开发者

PDF extraction issue with apache PDFBox 1.3.1

开发者 https://www.devze.com 2023-02-16 19:33 出处:网络
I am facing some issue while extracting data from PDF using apache PDFBox. With PDFBox version 1.1, i was able to extract the data properly. But the same code is giving different output with version 1

I am facing some issue while extracting data from PDF using apache PDFBox. With PDFBox version 1.1, i was able to extract the data properly. But the same code is giving different output with version 1.3.1. Only for few PDFs, I am facing this issue.

Code sample

PDDocument document = PDDocument.load(new File("sample.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition( true );
System.out.println(stripper.getText(document));

Here is the sample output:

With Version 1.1 : Account Number xxxxx xxxxxx-xx-x .....

With version 1.3.1: SCHDoe SISInrPnnvuttccraareillreuucfczeX dde,Pt reeF Hr rusdeDiIBc N dsDVeOe I:PiiTgdtlaYieutais Bll sXPuwF rn ew df ew l er .rdceo dS mwecritvhaiscte.cso 0 m 2 / 1 2 - 0431/01-1649-9105040.99 MURTgs Ac Bw开发者_运维知识库 TAoiucllttciaonol g PuA Danmyta otNeuunmt Dbueer 00$0T P9122a5/0/g3117e198. /4/211 17 11o6f0 3498-01-6 THITTTPTNoFHHoDC ttEE HDaaDE lliiAAP ggVXAM-hiTRtTFda A Tueo .....

Anybody has any idea what could be the problem?


I will recommend that you try PDFBox 1.5.0 from here - a lot of text extraction issues have been fixed in this release.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号