Can't read some PDF files with iTextSharp_问答_开发者

I have a Win32 application that reads PDFs using iTextSharp which inserts an image into the document as a seal.

It works fine with 99% of the files we are processing over a year, 开发者_如何学Pythonbut these days some files just don't read. When I execute the code below:

string inputfile = "C:\test.pdf";
PdfReader reader = new PdfReader(inputfile);

It gives the exception:

System.NullReferenceException occurred
  Message="Object reference not set to an instance of an object."
  Source="itextsharp"
  StackTrace:
       em iTextSharp.text.pdf.PdfReader.ReadPages()
       em iTextSharp.text.pdf.PdfReader.ReadPdf()
       em iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[] ownerPassword)
       em iTextSharp.text.pdf.PdfReader..ctor(String filename)
       em MyApp.insertSeal() na C:\MyApp\Stamper.cs:linha 659

The pdf files that throw these exception can be normally read by adobe pdf and when I open one of these files with Acrobat and save it I can read this saved file with my application.

Are the files corrupted but still can be opened with Adobe Reader?

I am sharing with you two samples of files.

A file that NOT work : Not-Ok-Version.pdf

And a file that works, after a opened and saved it with Acrobat. Download it here OK-Version.pdf

Here's the (java, sorry) source for readPages:

protected internal void ReadPages() {
  catalog = trailer.GetAsDict(PdfName.ROOT);
  rootPages = catalog.GetAsDict(PdfName.PAGES);
  pageRefs = new PageRefs(this);
}

trailer,catalog,rootPages, andpageRefs` are all member variables of PdfReader.

If the trailer or root/catalog object of a PDF are simply missing, your PDF is REALLY BADLY BROKEN. It's more likely that the xref table is a bit off, and the objects in question simply aren't exactly where they're supposed to be (which is Bad, but recoverable).

HOWEVER, when PdfReader first opens a PDF, it parses ALL the objects in the file, and converts them to the appropriate PdfObject-derived classes.

What it isn't doing is checking to see that the object number claimed by the xref table and the object number read in from the file Actually Match. Highly Unlikely, but possible. Bad software could write out their PDF objects in the wrong order but keep the byte offsets in the xref table correct. Software that overrode the object number from the xref table with the number from that particular byte offset in the file would be fine.

iText is not fine.

I still want to see the PDF.

Yep. That PDF is broken alright. Specifically:

The file's first 70kb or so define a pretty clean little PDF. Changes were then appended to the PDF.

Check that. Someone attempted to append changes to the PDF and failed. Badly. To understand just how badly, let me explain some of the internal syntax of a PDF, illustrated with this example:

%%PDF1.6
1 0 obj
<</Type/SomeObject ...>>
endobj
2 0 obj
<</Type/SomeOtherObj /Ref 1 0 R>>
endobj
3 0 obj
...
endobj
<etc>
xref
0 10
0000000000 65535 f
0000000010 00001 n
0000000049 00002 n
0000000098 00003 n
...
trailer
<</Root 4 0 R /Size 10>>
startxref 124
%%EOF

So we have a header/version "%%PDF1.v", a list of objects (the ones here are called dictionaries), a cross (x) reference table listing the byte offsets and object numbers of all the objects in the list, and a trailer giving the root object & the number of objects in the PDF, and the byte offset to the 'x' in 'xref'.

You can append changes to an existing PDF. To do so you just add any new or changed objects after the existing %%EOF, a cross reference table to those new objects, and a trailer. The trailer of an appended change should include a /Prev key with the byte offset to the previous cross reference table.

In your NOT-OKAY pdf, someone tried to append changes to a PDF, AND FAILED HORRIBLY.

The original PDF is still there, intact. That's what Reader shows you, and what you get when you save the PDF. I hacked off everything after the first %%EOF in a hex editor, and the file was fine.

So here's the layout of your NOT-OKAY pdf:

%PDF1.4.1
1 0 obj...
2 through 7
xref
0 7
<healthy xref>
trailer <</Size 8 /Root 6 0 R /Info 7 0 R>>
startxref 68308
%%EOF

So far so good. Here's where things get ugly

<binary garbage>
endstream
endobj
xref 
0 7
<horribly wrong xref>
trailer <</ID [...] /Info 1 0 R /Root 2 0 R /Size 7>>
startxref 223022
%%EOF

The only thing RIGHT about that section is the startxref value.

Problems:

The second trailer has no /Prev key.
ALL the byte offsets in the second xref table are wrong.
The is part of a "stream" object, but the beginning of that object IS MISSING. Streams should look something like this

1 0 obj
<</Type/SomeType/Length 123>>
stream
123 bytes of data
endstream
endobj

The end of this file is made up of some portion of a (compressed I'd imagine) stream... but without the dictionary at the beginning telling us what filters its using and how long it is (to say nothing of any missing data), you can't do anything with it.

I suspect that someone tried to completely rebuild this PDF, then accidentally wrote the original 70kb over the beginning of their version. Kaboom.

It would appear that Adobe is simply ignoring the bad appended changes. iText could do this too, but so can you:

When iText fails to open a PDF:
1. Search backwards through the file looking for the second to last %%EOF. Ignore the one at the very end, we want the previous state of the file. 2. Delete everything after the 2nd-to-last %%EOF (if any), and try to open it again.

The sad thing is that this broken PDF could have been completely different from the "original" 70kb, and then some IO error overwrote the first part of the file. Unlikely, but there's no way to be sure.

Considering that they are now up to version 5.0, my guess would be that you are seeing increasing numbers of PDFs written to PDF version specs that your version of iTextSharp does not support. It may be time to do an upgrade.

Maybe this will help someone... I had code that worked for years that started hanging on reading the bookmarks from a PDF file (outlines variable below). It turned out that it broke when the code was updated from .NET 4.0 to .NET 4.5.
As soon as I rolled it back to .NET 4.0, it worked again.

        RandomAccessFileOrArray raf = null;
        PdfReader reader1 = null;
        System.Collections.ArrayList outlines = null;
        raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sFile);
        reader1 = new iTextSharp.text.pdf.PdfReader(raf, null);
        outlines = iTextSharp.text.pdf.SimpleBookmark.GetBookmark(reader1);

Just for notes, the same VS web application project uses AjaxControlToolkit (from NuGet). Before I rolled it back, I also updated iTextSharp to ver 5.5.5 and it still hung on the same line.

When I pull down the source and run it against the bad PDF there's an exception in ReadPdf() in the 4th try block when it calls ReadDocObj():

"Invalid object number. at file pointer 16"

tokens.StringValue is j

@Mark Storer, you're the iText guy so maybe that means something to you.

From a higher level, at least to my eyes, it seems that when RebuildXref() is called (which I assume is when an invalid PDF is read) it rebuilds trailer but not catalog. The latter is what the NRE is complaining about. Then again, that's just a guess.

Also make sure your html doesn't contains hr tag while converting html to pdf

hdnEditorText.Value.Replace("\"", "'").Replace("<hr />", "").Replace("<hr/>", "")

Can't read some PDF files with iTextSharp

精彩评论

关注公众号

热门标签

图文推荐

Can't read some PDF files with iTextSharp

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：