开发者

PDF parsing specific text

开发者 https://www.devze.com 2023-03-11 06:34 出处:网络
hi I\'m working on an app that parses out pdf data for viewing on mobile devices, I\'m looking for a way to scan through a pdf file for specific text and getting the x & y coordinates of that text

hi I'm working on an app that parses out pdf data for viewing on mobile devices, I'm looking for a way to scan through a pdf file for specific text and getting the x & y coordinates of that text block. Is that even possible. I working on a Linux server, with 开发者_JS百科php but I'm flexible to use whatever means to get this working. Thanks.


Commercial options:

  • TET (Text Extraction Toolkit) SDK from http://www.pdflib.com; Acrobat plug-in available for testing the mechanism
  • pdfToolbox SDK from http://www.callassoftware.com; interactive desktop version available for testing
  • if you are ready to do some more of the coding yourself: Adobe PDF Library, SDK, available through Datalogics

All are pretty mature, TET is very specific to text extraction, pdfToolbox is a general purpose SDK for analyzing and manipulating PDFs (but has a specific feature to do text extraction, with coordinates of text on the page), and Adobe PDF Library is rather a general purpose development tool (offers a lot of low level features, but code would have to be written that does find text/words/characters and pulls out the coordinates).

Disclaimer: I work for callas software, my view on pdfToolbox may be biased.

0

精彩评论

暂无评论...
验证码 换一张
取 消