So far here is the code I have (it is working and extracting text as it should.)
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content开发者_Go百科 = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("/home/nick/TAM_work/TAM_pdfs/2006-1.pdf").encode("ascii", "ignore")
I now need to add a for loop to get it to run on all PDF's in /TAM_pdfs, save the text as a CSV and (if possible) add something to count the pictures. Any help would be greatly appreciated. Thanks for looking.
Matt
Take a look at os.walk()
for loop to get it to run on all PDF's in a directory: look at the glob module
save the text as a CSV: look at the csv module
count the pictures: look at the pyPDF module :-)
Two comments on this statement:
content = " ".join(content.replace(u"\xa0", " ").strip().split())
(1) It is not necessary to replace the NBSP (U+00A0) with a SPACE, because NBSP is (naturally) considered to be whitespace by unicode.split()
(2) Using strip() is redundant:
>>> u" foo bar ".split()
[u'foo', u'bar']
>>>
The glob
module can help you find all files in a single directory that match a wildcard pattern.
精彩评论