开发者

Reading a PDF document with iText not working sometimes

开发者 https://www.devze.com 2022-12-12 06:42 出处:网络
I am using iText to read from a PDF doc. I am getting an ArrayIndexOutOfBoundsException. The strange thing is it only happens for certain files and at certain locations in those files. I suspect it\'s

I am using iText to read from a PDF doc. I am getting an ArrayIndexOutOfBoundsException. The strange thing is it only happens for certain files and at certain locations in those files. I suspect it's something to do with the way the PDF is encoded at those locations but can't figure out what the problem is.

I have looked at this question Read pdf using iText but he seems to have solved his problem by changing the location of this file. This is not going to work for me as I get the exception at certain locations within some files - so it's not the file itself but the page in question that is causing the exception.

The stack trace is

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02 at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown Source) at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source) at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source) at com.pdfextractor.main.Extractor.main(Extractor.java:61)

And line 61 corresponds to this line:

content = extractor.getTextFromPage(page);

So it seems quite obvious that the getTextFromPage() method is not working.

public static void main(String[] args) throws IOException{
    ArrayList<String> keywords = new ArrayList<String>();
        keywords.add("location");
        keywords.add("Mass Spectrometry");  
        keywords.add("vacuole");
        keywords.add("cytosol");

    String directory = "C:/Ankur/Projects/PEB/Extractor/papers/";
    File directoryToRead = new File(directory); 
    String[] sa_filesToRead = directoryToRead.list();
    List<String> filesToRead = Arrays.asList(sa_filesToRead);

    Iterator<String> fileItr = filesToRead.iterator();
    while(fileItr.hasNext()){           

        String nextFile = fileItr.next();

     开发者_StackOverflow   PdfReader reader = new PdfReader(directory+nextFile);
        int noPages = reader.getNumberOfPages();
        PdfTextExtractor extractor = new PdfTextExtractor(reader);

    String content=""; 
    for(int page=1;page<=noPages;page++){
        int index = 1;
        System.out.println(page);
        content = extractor.getTextFromPage(page);

        }       
    }
    }


Most Java classes/libraries expect that a method like getTextFromPage(int) are indexed starting at 0 - meaning that getTextFromPage(0) should return the text from page 1, getTextFromPage(1) should return the text from page 2.

Your for loop that causes the ArrayIndexOutOfBoundsException is indexed starting with 1.

Are you sure that iText's getTextFromPage(int) is indexed starting at 1 rather than the (almost) standard 0?


Have you tried posting on the very active IText mailing list?


I have a similar problem and it always occurred where the text contains special characters. I wonder if there is a way to work around the encoding.

(Updated) I had this problem with com.itextpdf.itextpdf of 5.1.3 but after it's updated to 5.3.4. This problem has been fixed.

0

精彩评论

暂无评论...
验证码 换一张
取 消