I'm trying to read metadata attached to arbitrary PDFs: title, author, subject, and keywords.
Is there a PHP library, preferably open-source, that can read PDF metadata? If so, or if there isn't, how would one use the library (or lack thereof) to extract the metadata?
To be clear, I'm not interested in creating or modifying PDFs or their metadata, and I don't care about the PDF bodies. I've looked at a number o开发者_如何学Gof libraries, including FPDF (which everyone seems to recommend), but it appears only to be for PDF creation, not metadata extraction.
PDF Parser does exactly what you want and it's pretty straightforward to use:
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
$text = $pdf->getDetails();
You can try it in the demo page.
The Zend framework includes Zend_Pdf, which makes this really easy:
$pdf = Zend_Pdf::load($pdfPath);
echo $pdf->properties['Title'] . "\n";
echo $pdf->properties['Author'] . "\n";
Limitations: Works only on files without encryption smaller then 16MB.
<?php
$sourcefile = "file path";
$stringedPDF = file_get_contents($sourcefile, true);
preg_match('/(?<=Title )\S(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))./', $stringedPDF, $title);
echo $all = $title[0];
I was looking for the same thing today. And I came across a small PHP class over at http://de77.com/ that offers a quick and dirty solution. You can download the class directly. Output is UTF-8 encoded.
The creator says:
Here’s a PHP class I wrote which can be used to get title & author and a number of pages of any PDF file. It does not use any external application - just pure PHP.
// basic example
include 'PDFInfo.php';
$p = new PDFInfo;
$p->load('file.pdf');
echo $p->author;
echo $p->title;
echo $p->pages;
For me, it work's! All thanks goes solely to the creator of the class ... well, maybe just a little bit thanks to me too for finding the class ;)
You may use PDFtk to extract the page count:
// Windows
$bin = realpath('C:\\pdftk\\bin\\pdftk.exe');
$cmd = "cmd /c {$bin} {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*//'";
// Unix
$cmd = "pdftk {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*
If ImageMagick is available you may also use:
$cmd = "identify -format %n {$path}";
Execute in PHP via shell_exec():
$res = shell_exec($cmd);
精彩评论