开发者

Have writen a program to extract text from a PDF in python, and now need to make it run for every PDF in the folder and save as a text file

开发者 https://www.devze.com 2022-12-15 20:29 出处:网络
So far here is the code I have (it is working and extracting text as it should.) import pyPdf def getPDFContent(path):

So far here is the code I have (it is working and extracting text as it should.)

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content开发者_Go百科 = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

print getPDFContent("/home/nick/TAM_work/TAM_pdfs/2006-1.pdf").encode("ascii", "ignore")

I now need to add a for loop to get it to run on all PDF's in /TAM_pdfs, save the text as a CSV and (if possible) add something to count the pictures. Any help would be greatly appreciated. Thanks for looking.

Matt


Take a look at os.walk()


for loop to get it to run on all PDF's in a directory: look at the glob module

save the text as a CSV: look at the csv module

count the pictures: look at the pyPDF module :-)

Two comments on this statement:

content = " ".join(content.replace(u"\xa0", " ").strip().split())

(1) It is not necessary to replace the NBSP (U+00A0) with a SPACE, because NBSP is (naturally) considered to be whitespace by unicode.split()

(2) Using strip() is redundant:

>>> u"  foo  bar  ".split()
[u'foo', u'bar']
>>>


The glob module can help you find all files in a single directory that match a wildcard pattern.

0

精彩评论

暂无评论...
验证码 换一张
取 消