which module is efficient for parsing a .pdf file in one go ? CAM::PDF or PDF::API2_问答_开发者

which module is efficient for parsing a .pdf file in one go ? CAM::PDF or PDF::API2

开发者 https://www.devze.com 2023-03-06 21:22 出处：网络

I want to extract all the keywords from a huge pdf file [50MB] ? which module is good for large pdf files to parse ?

相关专题：perl

I want to extract all the keywords from a huge pdf file [50MB] ? which module is good for large pdf files to parse ? I'm con开发者_JS百科cerned with memory for parsing huge file & extracting almost all the keywords ! Here i want SAX kind of parsing [one go parsing ] & not DOM kind of [ analogy to XML].

To read text out of a PDF, we use CAM::PDF, and it worked just fine. It wasn't hugely fast on some larger files, but the ability to handle large files was not bad. We certainly had a few that were ~100Mb, and which were handled OK. If I recall, we struggled with a few that were 130Mb on a 32-bit (Windows) Perl, but we had a whole lot of other stuff in memory at the time. We did look at PDF::API2, but it seemed more oriented to generating PDFs that reading from them. We didn't throw large files into PDF::API2, so I can't give a real benchmark figure.

The only significant downside we found with using CAM::PDF is that PDF 1.6 is becoming more common, and that doesn't work at all in CAM::PDF yet. That might not be an issue for you, but it might be something to consider.

In answer to your question, I'm pretty sure both modules read the whole source PDF into memory in one form or another, but I don't think CAM::PDF builds as many more complex structures out of it. So neither is really SAX-like, but CAM::PDF seemed to be lighter in general, and can retrieve one page at a time, so might reduce the load for extracting very large texts.