I have task to preflight existing pdf-file on following parameters:
- Accordance embedded fonts and used fonts in text layers;
- Physical size (width and height in mm.) of document.
- Color profile for each image layer and whole document.
- Bleed/trim/开发者_高级运维art box of document.
I need to do it with .net framework. Any suggestions?
Take a look at iText or you implement your own solution based on poppler, which gives you realy low-level access to pdf documents.
I'm biased (commiter), but I suggest you use iText.
Your use of the word "layer" leads me to believe you don't mean (or understand) what "layer" usually means in PDF.
In PDF, layers are also called "optional content groups". Parts of a given page that can be toggled on and off using various bits of logic (current zoom level for instance).
Text and images in PDF can have arbitrary depth/Z order. Text can be on top of an image, which can overlap some other text, which can be drawn over some other image, which... you get the idea. It doesn't happen that way very often (if ever), but it's possible.
But my understanding of what you're trying to ask is that you want the coordinates & graphic state of every piece of text and ever image on a given page.
iText can do that, thanks to the fairly new parser package. In particular, PdfReaderContentParser with a custom RenderListener.
In your implementations of renderText and renderImage you'd store/examine everything you needed.
That gets you most of the way to 1 and 3. Digging up the color/embedding info will require some low-level schlepping about with PdfDictionary
et al, and some knowledge of the PDF Specification.
Number 2 and 4 are kinda funky based on how you phrased them, but the actuality is pretty straight forward.
PDF pages can have 5 different boxes:
- Media Box: Initial size of the page. Required
- Crop Box: Size of the finished page. Optional, defaults to the media box if not explicitly defined.
- Trim Box: Some other printer finishing thing that isn't the crop box. Optional, defaults to the crop box.
- Art box: A bounding box that contains everything visible on the page (or something), defaults to the crop box.
- Bleed box: some other printing thing, defaults to the (you guessed it!) crop box.
All these "defaults to the X" are implicit. If you ask for the trim box I might get "null", in which case it's My Responsibility to check the crop box. If I get a null again, then I need to check the media box.
So when you ask for the physical dimensions, You might mean the media box, or the crop box... or maybe even the trim box (though I doubt it because you explicitly mention it later).
And when you want to know one of those other boxes, you need to know what it is when that value isn't present.
Okay, so that's the theory. Nuts and bolts time (in Java):
Rectangle[] getBoxen(PdfReader reader, int pageINDEX) {
Rectangle retRects[] = new Rectangle[5];
retRects[0] = reader.getBoxSize(pageINDEX, "media");
retRects[1] = reader.getBoxSize(pageINDEX, "crop");
retRects[2] = reader.getBoxSize(pageINDEX, "trim");
retRects[3] = reader.getBoxSize(pageINDEX, "art");
retRects[4] = reader.getBoxSize(pageINDEX, "bleed");
// handle defaults
// crop box defaults to media box
if (retRects[1] == null) {
retRects[1] = retRects[0];
}
// everything else defaults to the crop box
for (int i = 2; i < 4; ++i) {
if (retRects[i] == null) {
retRects[i] = retRects[1];
}
}
return retRects;
}
精彩评论