Summary: How can I reduce the amount of time it takes to convert tifs to pdfs using itextsharp
?
Background: I'm converting some fairly large tif's to pdf using C# and itextsharp
, and I am getting extremely bad performance. The tif files are approximately 50kb a piece, and some documents have up to 150 seperate tif files (each representing a page). For one 132 page document (~6500 kb) it took about 13 minutes to convert. During the conversion, the single CPU server it was running on was running at 100%, leading me to believe the process was CPU bound. The output pdf file was 3.5 MB. I'm ok with the size, but the time taken seem a bit high to me.
Code:
private void CombineAndConvertTif(IList<FileInfo> inputFiles, FileInfo outputFile)
{
using (FileStream fs = new FileStream(outputFile.FullName, FileMode.Create, FileAccess.ReadWrite, FileShare.None))
{
Document document = new Document(PageSize.A4, 50, 50, 50, 50);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
document.Open();
PdfContentByte cb = writer.DirectContent;
foreach (FileInfo inp开发者_如何学运维utFile in inputFiles)
{
using (Bitmap bm = new Bitmap(inputFile.FullName))
{
int total = bm.GetFrameCount(FrameDimension.Page);
for (int k = 0; k < total; ++k)
{
bm.SelectActiveFrame(FrameDimension.Page, k);
//Testing shows that this line takes the lion's share (80%) of the time involved.
iTextSharp.text.Image img =
iTextSharp.text.Image.GetInstance(bm, null, true);
img.ScalePercent(72f / 200f * 100);
img.SetAbsolutePosition(0, 0);
cb.AddImage(img);
document.NewPage();
}
}
}
document.Close();
writer.Close();
}
}
Modify GetInstance method argument to
GetInstance(bm, ImageFormat.Tiff)
this might increase the performance
iTextSharp.text.Image img = iTextSharp.text.Image.GetInstance(bm, ImageFormat.Tiff);
I'm not sure what was available when this question was originally posted but it appears iText 5.x has more to offer when converting TIFF to PDF. There is also a basic code sample in iText in Action 2nd Edition "part3.chapter10.PagedImages" and I haven't noticed any performance problems. However, the sample doesn't handle scaling well so I changed it like this:
public static void AddTiff(Document pdfDocument, Rectangle pdfPageSize, String tiffPath)
{
RandomAccessFileOrArray ra = new RandomAccessFileOrArray(tiffPath);
int pageCount = TiffImage.GetNumberOfPages(ra);
for (int i = 1; i <= pageCount; i++)
{
Image img = TiffImage.GetTiffImage(ra, i);
if (img.ScaledWidth > pdfPageSize.Width || img.ScaledHeight > pdfPageSize.Height)
{
if (img.DpiX != 0 && img.DpiY != 0 && img.DpiX != img.DpiY)
{
img.ScalePercent(100f);
float percentX = (pdfPageSize.Width * 100) / img.ScaledWidth;
float percentY = (pdfPageSize.Height * 100) / img.ScaledHeight;
img.ScalePercent(percentX, percentY);
img.WidthPercentage = 0;
}
else
{
img.ScaleToFit(pdfPageSize.Width, pdfPageSize.Height);
}
}
Rectangle pageRect = new Rectangle(0, 0, img.ScaledWidth, img.ScaledHeight);
pdfDocument.SetPageSize(pageRect);
pdfDocument.SetMargins(0, 0, 0, 0);
pdfDocument.NewPage();
pdfDocument.Add(img);
}
}
The trouble is with the length of time it takes for iTextSharp to finishing messing around with your System.Drawing.Image object.
To speed this up to literally a 10th of a second in some tests I have run you need to save the selected frame out to a memory stream and then pass the byte array of data directly to the GetInstance method in iTextSharp, see below...
bm.SelectActiveFrame(FrameDimension.Page, k);
iTextSharp.text.Image img;
using(System.IO.MemoryStream mem = new System.IO.MemoryStream())
{
// This jumps all the inbuilt processing iTextSharp will perform
// This will create a larger pdf though
bm.Save(mem, System.Drawing.Imaging.ImageFormat.Png);
img = iTextSharp.text.Image.GetInstance(mem.ToArray());
}
img.ScalePercent(72f / 200f * 100);
You're crunching quite a lot of data, so if the PDF export process is slow, and you're not using a fast PC, then you may be stuck with that sort of performance.
The most obvious way to speed this up on a multi-core system would be to multi-thread it.
Break the code into two stages. First, a set of images can be converted and stored in a list, then the list can be output to the PDF. With the file sizes you're talking about, memory usage to store the entire document in memory during processing shouldn't be a problem.
You can then make the first stage of this process multi-threaded - you could fire off a threadpool thread for each image that needs to be converted, capping the number of active threads (roughtly one per CPU core is enough - any more won't gain you much). Aln alternative is to split your list of inputs into n lists (again, one list per CPU core) and then fire off threads that just process their own list. This reduces the threading overheads, but may result in some threds finishing a long time before others (if their workload turns out to be a lot less) so it may not always work out quite as fast.
By splitting it into two passes you may also gain performance (even without mutlithreading) by doing all the input processing and then all the output processing as separate stages, which will probably reduce the disk seeking involved (depending on how much RAM you have available for disk caches on your PC).
Note that mutithreading it won't be of much use if you only have a single core CPU (though you could still see gains in parts of the process that are I/O bound, it sounds like you're primarily CPU bound).
You could also experiment with resizing the bitmap using something other than itextsharp calls - I don't know anything about itextsharp but it is possible that its image conversion code is slow, or does not make use of graphics hardware in a way that other scaling techniques may be able to. There may also be some scaling options that you can set that will give you a trade-off between quality and speed that you could try.
I had this exact problem. I ended up using Adobe Acrobat's Batch Processing feature which worked well. I just set up a new Batch Process that converts all the tiffs in a target folder to PDFs written to a destination folder and started it. It was easy to set up but processing took longer than I liked. It did get the job done.
Unfortunately Adobe Acrobat is not free, but you should consider it (weighing the cost of your time to develop a 'free' solution vs. the cost of the software).
//Testing shows that this line takes the lion's share (80%) of the time involved.
iTextSharp.text.Image img =
iTextSharp.text.Image.GetInstance(bm, null, true);
Might be stupid suggestion (don't have a large testset right now to try it locally), but give me the benefit of the doubt:
You're looping through a multitiff here, selecting frame after frame. bm is this (huge, 6.5M) image, in memory. I don't know enough about iTextSharps internal image handling, but maybe you can help here by just providing a single page image here? Can you try creating a new Bitmap of the desired size, drawing bm on it (look at the options to the Graphics object for properties related to speed: InterpolationMode for example) and passing in this single image instead of the huge thing for each call?
Based on your samples, i made a function that does both based on a simple Enum, you define the working mode, here it is:
private static void CombineAndConvertTif(FileInfo inputFile, FileInfo outputFile)
{
Encoder myEncoder = Encoder.Quality;
EncoderParameters myEncoderParameters = new EncoderParameters(1);
EncoderParameter myEncoderParameter = new EncoderParameter(myEncoder, 50L);
myEncoderParameters.Param[0] = myEncoderParameter;
ImageCodecInfo jgpEncoder = GetEncoder(ImageFormat.Jpeg);
Console.Write("Converting {0} to {1}... ", inputFile.Name, outputFile.Name);
Stopwatch sw = Stopwatch.StartNew();
using (
FileStream fs = new FileStream(
outputFile.FullName, FileMode.Create, FileAccess.ReadWrite, FileShare.None))
{
Document document = new Document(PageSize.A4, 50, 50, 50, 50);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
writer.CompressionLevel = 100;
writer.SetFullCompression();
document.Open();
PdfContentByte cb = writer.DirectContent;
using (Bitmap bm = new Bitmap(inputFile.FullName))
{
int pages = bm.GetFrameCount(FrameDimension.Page);
for (int currentPage = 0; currentPage < pages; ++currentPage)
{
bm.SelectActiveFrame(FrameDimension.Page, currentPage);
bm.SetResolution(96, 96);
Image img;
if (QualityMode == QualityMode.Slow)
{
#region Low speed, smaller files
img = iTextSharp.text.Image.GetInstance(bm, null, true);
#endregion
}
else
{
#region Fast speed, bigger files
using (MemoryStream mem = new MemoryStream())
{
bm.Save(mem, jgpEncoder, myEncoderParameters);
img = Image.GetInstance(mem.ToArray());
}
#endregion
}
img.ScalePercent(72f / 200f * 100);
img.SetAbsolutePosition(0, 0);
cb.AddImage(img);
document.NewPage();
}
}
document.Close();
writer.Close();
}
sw.Stop();
Console.WriteLine(" time: {0}", sw.Elapsed);
}
And the enum is:
internal enum QualityMode
{
/// <summary>
/// Process images quickly but
/// produces bigger PDFs
/// </summary>
Fast,
/// <summary>
/// Process images slower but
/// produces smaller PDFs
/// </summary>
Slow
}
精彩评论