extract text from word or pdf based on format (font name and size)_问答_开发者

extract text from word or pdf based on format (font name and size)

开发者 https://www.devze.com 2022-12-12 06:15 出处：网络

I need to parse large text (about 1000 pages of word or pdf document)and place some of the text inside this document into database fields

I found that the only thing I can distinguish the text I want to extract is the format , it is always "Helvetica-Condensed" size 12

can I do that ? I know how to use the string functions but what I should use to test the format ?

as I said the text is stored inside word document or PDF

开发者_高级运维if there is third party component can do no problem please refer it to me

Thanks

There is QuickPDF. The price is $249,00.

The other option is to code it yourself. The file specification is available online, and if your only trying to rip the text out of the document this should guide you most of the way.

The only thing to be careful of are documents which are built entirely from images. In that scenario (no matter what you use to read the file) you will also need an OCR type of application. To see if this is the case or not, open a sample of the type of file you are wanting to "extract" text from, select the text to copy then try to paste into notepad.