PDF parsing specific text_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-11 06:34 出处：网络

hi I\'m working on an app that parses out pdf data for viewing on mobile devices, I\'m looking for a way to scan through a pdf file for specific text and getting the x & y coordinates of that text

相关专题：parsing pdf php

hi I'm working on an app that parses out pdf data for viewing on mobile devices, I'm looking for a way to scan through a pdf file for specific text and getting the x & y coordinates of that text block. Is that even possible. I working on a Linux server, with 开发者_JS百科php but I'm flexible to use whatever means to get this working. Thanks.

Commercial options:

TET (Text Extraction Toolkit) SDK from http://www.pdflib.com; Acrobat plug-in available for testing the mechanism
pdfToolbox SDK from http://www.callassoftware.com; interactive desktop version available for testing
if you are ready to do some more of the coding yourself: Adobe PDF Library, SDK, available through Datalogics

All are pretty mature, TET is very specific to text extraction, pdfToolbox is a general purpose SDK for analyzing and manipulating PDFs (but has a specific feature to do text extraction, with coordinates of text on the page), and Adobe PDF Library is rather a general purpose development tool (offers a lot of low level features, but code would have to be written that does find text/words/characters and pulls out the coordinates).

Disclaimer: I work for callas software, my view on pdfToolbox may be biased.