开发者

ITextSharp taking too much time in getting Number of Pages

开发者 https://www.devze.com 2023-04-12 15:24 出处:网络
I have this piece of code: foreach(string pdfFile in Directory.EnumerateFiles(selectedFolderMulti_txt.Text,\"*.pdf\",SearchOption.AllDirectories))

I have this piece of code:

foreach(string pdfFile in Directory.EnumerateFiles(selectedFolderMulti_txt.Text,"*.pdf",SearchOption.AllDirectories))
{
    //filePath = pdfFile.FullName;
    //string abc = Path.GetFileName(pdfFile);
    try
    {
        //pdfReader = new iTextSharp.text.pdf.PdfReader(filePath);
        pdfReader = new iTextSharp.text.pdf.PdfReader(pdfFile);
        rownum = pdfListMulti_gridview.Rows.Add();
        pdfListMulti_gridview.Rows[rownum].Cells[0].Value = 开发者_StackOverflow社区counter++;
        //pdfListMulti_gridview.Rows[rownum].Cells[1].Value = pdfFile.Name;
        pdfListMulti_gridview.Rows[rownum].Cells[1].Value = System.IO.Path.GetFileName(pdfFile);
        pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;
        //pdfListMulti_gridview.Rows[rownum].Cells[3].Value = filePath;
        pdfListMulti_gridview.Rows[rownum].Cells[3].Value = pdfFile;
        //totalpages += pdfReader.NumberOfPages;
    }
    catch
    {
        //MessageBox.Show("There was an error while opening '" + pdfFile.Name + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
        MessageBox.Show("There was an error while opening '" + System.IO.Path.GetFileName(pdfFile) + "'", "Error!", MessageBoxButtons.OK, MessageBoxIcon.Error);
    }
}

Problem is that when today I specified a folder having about 4000 pdf files, It took about 20 minutes to read all files and show me the results. Then, I thought what will this code do when I will input a folder having more than 20,000 files.

If I comment out this line:

pdfListMulti_gridview.Rows[rownum].Cells[2].Value = pdfReader.NumberOfPages;

Then, it seems if all of the processing burden is removed from the code.

So, what I want from you guys is a suggestion for making my approach efficient and less time should be taken to process all files. Or there is any alternative?


Definitely do what @ChrisBint said, that will get past Window's slowness with folders with many files.

But to get even more speed make sure to use the overload of PdfReader that takes a RandomAccessFileOrArray object instead. This object is way faster than regular streams in all of my testings. The constructor has a couple of overloads but you should mainly concern yourself with RandomAccessFileOrArray(string filename, bool forceRead). The second parameter is whether or not to load the entire file into memory (if I'm understanding the documentation correctly). For very large files this might be a performance hit but on modern machines it shouldn't matter much so I recommend that you pass true to this. If you pass false the disk will need to be hit several times as the parsing "cursor" walks through the file.

So with all of that you can do this in a very tight loop. For me, 4,000 files containing a total of over 42,000 pages takes about 2 seconds to run.

        var files = Directory.EnumerateFiles(workingFolder, "*.pdf");
        int totalPageCount = 0;
        foreach (string f in files)
        {
            totalPageCount += new PdfReader(new RandomAccessFileOrArray(f, true), null).NumberOfPages;
        }
        MessageBox.Show(String.Format("Total Page Count : {0:N0}", totalPageCount));


Personally, I would change your code slightly to not call the Directory.EnumerateFiles in the foreach. For example;

var listOfFiles = Directory.EnumerateFiles(selectedFolderMulti_txt.Text,"*.pdf",SearchOption.AllDirectories);
foreach(string pdfFile in listOfFiles)
{
//Do something
}

I doubt this would impact the overall time by a massive amount, if any.

As far the speed to call the NumberOfPages property. It is unlikely that you will be able to optimise this due to be internal to the pdfReader object. If performance is a concern, then this may require additional hardware.

Personally, I would not factor this as an issue unless I have to continually run the scan (in which case I would start looking at caching/checking for existing files and only adding those that have changed/new).

0

精彩评论

暂无评论...
验证码 换一张
取 消