How do extract text layer and background layer from pdf?_问答_开发者

In my project I've to do a PDF Viewer in HTML5/CSS3 and the application has to allow user to add comments and annotation. Actually, I've to do something very similar to crocodoc.com.

At the beginning I was thinking to create images from the PDF and allow user create area and post comments associates to this area. Unfortunately, the client wants also navigate in this PDF and add only comments on allowed sections (for开发者_StackOverflow社区 example, paragraphs or selected text).

And now I'm in front of one problem that is to get the text and the best way to do it. If any body has some clues how I can reach it, I would appreciate.

I tried pdftohtml, but output doesn't look like the original document whom is really complex (example of document). Even this one doesn't reflect really the output, but is much better than pdftohtml.

I'm open to any solutions, with preference for command line under linux.

I've been down the same road as you, with even much more complex tasks.

After trying out everything I ended up using C# under Mono (so it runs on linux) with iTextSharp.

Even with a very complete library such as iTextSharp, some tasks required allot of trial-and-error :)

To extract the text from a page is easy (check the below snipper), however if you intend to keep the text coordinates, fonts and sizes, you will have more work to do.

int pdf_page = 5;
string page_text = "";

PdfReader reader = new PdfReader("path/to/pdf/file.pdf");
PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page));
while(token.NextToken())
{
    if(token.TokenType == PRTokeniser.TokType.STRING)
    {
        page_text += token.StringValue;
    }
    else if(token.StringValue == "Tj")
    {
        page_text += " ";
    }
}

Do a Console.WriteLine(token.StringValue) on all tokens to see how paragraphs of text are structured in PDFs. This way you can detect coordinates, font, font size, etc.

Addition:

Given the task you are required to do, I have a suggestion for you:

Extract the text with coordinates and font families and sizes - all information about each paragraph. Then, to a PDF-to-images, and in your online viewer, apply invisible selectable text over the paragraphs on the image where needed.

This way your users can select a part of the text where needed, without the need of reconstructing the whole PDF in html :)

I recently researched and discovered a native PHP solution to achieve this using FOSS. The FPDI PHP class can be used to import a PDF document for use with either the TCPDF or FPDF PHP classes, both of which provide functionality for creating, reading, updating and writing PDF documents. Personally, I prefer TCPDF as it provides a larger feature set (TCPDF vs. FPDF), a richer API (TCPDF vs. FPDF), more usage examples (TCPDF vs. FPDF) and a more active community forum (TCPDF vs. FPDF).

Choose one of the before mentioned classes, or another, to programmatically handle PDF documents. Focusing on both current and possible future deliverables, as well as the desired user experience, decide where (e.g. server - PHP, client - JavaScript, both) and to what extent (feature driven) your interactive logic should be implemented.

Personally, I would use a TCPDF instance obtained by importing a PDF document via FPDI to iteratively inspect, translate to a common format (XML, JSON, etc.) and store the resulting representation in relational tables designed to persist data pertinent to the desired level of document hierarchy and detail. The necessary level of detail is often dictated by a specifications document and its mention of both current and possible future deliverables.

Note: In this case, I strongly advise translating documents and storing them in a common format to create a layer of abstraction and transparency. For example, a possible and unforeseen future deliverable might be to provide the same application functionality for users uploading Microsoft Word documents. If the uploaded Microsoft Word document was not translated and stored in a common format then updates to the Web service API and dependent business logic would almost certainly be necessary. This ultimately results in storing bloated, sub-optimal data and inefficient use of development resources in designing, developing and supporting multiple translators. It would also be an inefficient use of server resources to translate outbound data for every request, as opposed to translating inbound data to an optimal format only once.

I would then extend the base document tables by designing and relating additional tables for persisting functionality specific document asset data such as:

Versioned Additions / Edits / Deletions

What
- Header / Footer
- Text
  - Original Value
  - New Value
- Image
  - Page(s) (one, many or all)
  - Location (relative - textual anchor, absolute - x/y coordinates)
  - File (relative or absolute directory or url)
- Brush (drawing)
  - Page(s) (one, many or all)
  - Location (relative - textual anchor, absolute - x/y coordinates)
  - Shape (x/y coordinates to redraw line, square, circle, user defined, etc.)
  - Type (pen, pencil, marker, etc.)
  - Weight (1px, 3px, 5px, etc.)
  - Color
- Annotation
  - Page
  - Location (relative - textual anchor, absolute - x/y coordinates)
  - Shape (line, square, circle, user defined, etc.)
  - Value (annotation text)
- Comment
  - Target (page, another text/image/brush/annotation asset, parent comment - threading)
  - Value (comment text)
When
- Date
- Time
Who
- User

Once some, all or more, of the document and its asset data has a place to persist I would design, document and develop a PHP Web service API to expose CRUD and PDF document upload functionality to the UI consumer, while enforcing core business rules. At this point, the remaining work now lies on the Client-side. Currently, I have relational tables persisting both a document and its asset data, as well as an API exposing sufficient functionality to the consumer, in this case the Client-side JavaScript.

I can now design and develop a Client-side application using the latest Web technologies such as HTML5, JavaScript and CSS3. I can upload and request PDF documents using the Web service API and easily render the returned common format out to the browser however I decide (probably HTML in this case). I can then use 100% native JavaScript and/or 3rd party libraries for DOM helper functionality, creating vector graphics to provide drawing and annotation features, as well as access and control functional and stylistic attributes of currently selected document text and/or images. I can provide a real-time collaborative experience by employing WebSockets (before mentioned WebService API does not apply), or a semi-delayed, but still fairly seamless experience using XMLHttpRequest.

From this point forward the sky is the limit and the ball is in your court!

It's a hard task you're trying to accomplish.

To read text from a PDF, have a look at PEAR's PDF_Reader proposal code.

There's also a very extensive documentation around Zend_PDF(), which also allows the loading and parsing of a PDF document. The various elements of the PDF can be iterated on and thus also being transformed to HTML5 or whatever you like. You may even embed the notations from your website into the PDFs and vice versa.

Still, you have been given no easy task. Good Luck.

pdftk is a very good tool to do thinks like that (I don't know if it can do exactly this task).

http://www.pdflabs.com/docs/pdftk-cli-examples/