开发者

Reading PDF file to get tabular data in structured format,

开发者 https://www.devze.com 2022-12-12 06:13 出处:网络
I have to read a pdf file which contains a table with several columns. U开发者_运维知识库sing iTextSharp I am able to read the file but I get bunch of non-formatted text. I am not able to structure th

I have to read a pdf file which contains a table with several columns. U开发者_运维知识库sing iTextSharp I am able to read the file but I get bunch of non-formatted text. I am not able to structure the data so that I can insert into a database.

Any suggestions?


Unless its structured text there is no tagging to show columns. Tools like PdfBox make 'guesses' to try and extract the table.

There is an article explaining why text extraction is so hard at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text


If I understand it correctly, pdf text is stored positionally, so it has no concept of rows or columns. That means you have to use heuristics based on the "likelihood" that a you're reading from a different column.

You can try doing this by comparing the amount of space between the words. (I'm not familiar with the ITextSharp interface so please forgive me if I'm mentioning things its not capable of. . . I'm mostly familiar with pdfNet.

Another idea that just came to me is that if the text has visual cues such as vertical lines separating the columns. If that's the case you should be able to come up with heuristics to determine if the text is left or right of the column lines.

...

However the best thing to do, if possible, is to get ahold of the data in a more database friendly format. This will likely save heartaches in the long run.

-- Jason


I am concluding there is no straight forward way to do this. Atleast reading the data in tabular format. I tried suggestions provided by Mark, but it is seems to be not feasible as per my requirement.

0

精彩评论

暂无评论...
验证码 换一张
取 消