开发者

Haskell: parsing PDF

开发者 https://www.devze.com 2022-12-22 05:25 出处:网络
What I need is to read pdf, make som开发者_JS百科e transformations (generate TOC bookmarks) and write it back.

What I need is to read pdf, make som开发者_JS百科e transformations (generate TOC bookmarks) and write it back.

I found this http://hackage.haskell.org/package/HPDF , but it only mentions generating pdf, not the parsing (although I could have missed it)

Haskell is chosen purely for (self)educational purposes.


There are a few tools for PDF manipulation, though they seem to bias towards generation, rather than parsing:

  • http://johnmacfarlane.net/pandoc/

Pandoc is a great cross-markup library, but doesn't support PDF parsing (it does support PDF generation from a variety of formats).

There's also:

  • http://hackage.haskell.org/package/HsHaruPDF
  • http://hackage.haskell.org/package/pdf2line -- tool for extracting text from pdf
  • http://hackage.haskell.org/package/HPDF -- another pdf generation library

I'm not sure we have a good parsing tool yet.


Also as a learning exercise, I started a PDF parsing library in Haskell, but it's incomplete and has been languishing a bit from lack of attention. I'd be happy to share it with you, and would love feedback, improvements, etc. It's not currently hosted on hackage, but if you're interested in working with an incomplete implementation, let me know and I'll ask some colleagues for advice on getting it up there.


Here's a haskell binding to parts of xpdf: http://hackage.haskell.org/package/pdf2line


Checkout pdf-toolbox library. It's support for PDF file generating is low level, but powerful enough for your task.

Here is an example how to change title of an existing PDF file using incremental update feature.


Another package to consider is rakhana which is also on hackage.

0

精彩评论

暂无评论...
验证码 换一张
取 消