Identify hidden text Word 2003/2007 using Apache POI_问答_开发者

Identify hidden text Word 2003/2007 using Apache POI

开发者 https://www.devze.com 2023-03-30 05:04 出处：网络

I am converting a Word (2003 and 2007) document to HTML 开发者_C百科format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.

Any help will be much valuable.

I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by

XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
   // it is hidden
}

Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.

The following code snippet helps in identifying if the text is hidden

POIFSFileSystem fs = null;

    boolean isHidden = false;
    try {
        fs = new POIFSFileSystem(new FileInputStream(filesname));
        HWPFDocument doc = new HWPFDocument(fs);
        WordExtractor we = new WordExtractor(doc);

        String[] paragraphs = we.getParagraphText();

        System.out.println("Word Document has " + paragraphs.length
                + " paragraphs");
        Range range = doc.getRange();

        for (int k = 0; k < range.numParagraphs(); k++) {

            org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
                    .getParagraph(k);
            paragraph.text().trim();
            paragraph.text().replaceAll("\\cM?\r?\n", "");

            for (int j = 0; j < paragraph.numCharacterRuns(); j++) {

                org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
                        .getCharacterRun(j);

                if (cr.isVanished()) {
                    // it is hidden
                    System.out.println("text is hidden ");
                    isHidden = true;
                    break;
                }

            }