开发者

Python to read PDF files [duplicate]

开发者 https://www.devze.com 2023-03-04 22:52 出处:网络
This question already has answers here: How to extract text from a PDF file? 开发者_高级运维 (33 answers)
This question already has answers here: How to extract text from a PDF file? 开发者_高级运维 (33 answers) Closed 1 year ago.

I have found many posts where solutions to read PDFs has been proposed. I want to read a PDF file word by word and do some processing on it. people suggest pdfMiner which converts entire PDF file into text file. But what i want is that to read PDFs word by word. Can anyone suggest a library that does this?


Possibly the fastest way to do this is to first convert your pdf inta a text file using pdftotext (on pdfMiner's site, there's a statement that pdfMiner is 20 times slower than pdftotext) and afterwards parse the text file as usual.

Also, when you said "I want to read a pdf file word by word and do some processing on it", you didn't specify if you want to do processing based on words in a pdf file, or do you actually want to modify the pdf file itself. If it's the second case, then you've got an entirely different problem on your hands.


I'm using pdfminer and it is an excellent lib especially if you're comfortable programming in python. It reads PDF and extracts every character, and it provides its bounding box as a tuple (x0,y0,x1,y1). Pdfminer will extract rectangles, lines and some images, and will try to detect words. It has an unpleasant O(N^3) routine that analyses bounding boxes to coalesce them, so it can get very slow on some files. Try to convert your typical file - maybe it'll be fast for you, or maybe it'll take 1 hour, depends on the file.

You can easily dump a pdf out as text, that's the first thing you should try for your application. You can also dump XML (see below), but you can't modify PDF. XML is the most complete representation of the PDF you can get out of it.

You have to read through the examples to use it in your python code, it doesn't have much documentation.

The example that comes with PdfMiner that transforms PDF into xml shows best how to use the lib in your code. It also shows you what's extracted in human-readable (as far as xml goes) form.

You can call it with parameters that tell it to "analyze" the pdf. If you do, it'll coalesce letters into blocks of text (words and sentences; sentences will have spaces so it's easy to tokenize into words in python).


Whereas I really liked the pdfminer answer I'd say that packages are not the same over time. Currenlty pdfminer still not support Python3 and may need to be updated. So, to update the subject -even if an answer have been already voted- I'd propose to go pdfrw, from the website :

  • Version 0.3 is tested and works on Python 2.6, 2.7, 3.3, 3.4, and 3.5 Operations include subsetting, merging, rotating, modifying metadata,etc
    • The fastest pure Python PDF parser available Has been used for years by a printer in pre-press production
    • Can be used with rst2pdf to faithfully reproduce vector images
    • Can be used either standalone, or in conjunction with reportlab to reuse existing PDFs in new ones
    • Permissively licensed
0

精彩评论

暂无评论...
验证码 换一张
取 消