开发者

How to extract paragraphs instead of whole texts only for XWPFWordExtractor (POI Library) Java

开发者 https://www.devze.com 2022-12-25 16:18 出处:网络
I开发者_JS百科 know the following code could extract whole texts of the docx document, however, I need to extract paragraph instead. Is there are possible way??

I开发者_JS百科 know the following code could extract whole texts of the docx document, however, I need to extract paragraph instead. Is there are possible way??

public static String extractText(InputStream in) throws Exception {

    JOptionPane.showMessageDialog(null, "Start extracting docx");
    XWPFDocument doc = new XWPFDocument(in);
    XWPFWordExtractor ex = new XWPFWordExtractor(doc);
    String text = ex.getText();
    return text;
}

Any helps would much appreciated. I need this so urgently.


That's just a guess after brief looking at the API:

doc.getParagraphs()

Link to the API: http://poi.apache.org/apidocs/org/apache/poi/xwpf/usermodel/XWPFDocument.html#getParagraphs()


I wrote utility method for this as below:

public static List<String> getParagraphs(File file)
    {
        List<String> paragraphs = new ArrayList<>();

        try
        {
            FileInputStream fis = new FileInputStream(file);
            XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
            List<XWPFParagraph> paragraphList = xdoc.getParagraphs();
            for (XWPFParagraph paragraph : paragraphList)
            {
                paragraphs.add(paragraph.getText());
            }
        }
        catch (Exception ex)
        {
            ex.printStackTrace();
        }
        return paragraphs;
    }


Though, the question is very old. I am answering in the hope to help if somebody's browser ended here in the quest of answer.

XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();

for(XWPFParagraph paragraph: paragraphs){
  System.out.println("Text in this paragraph: " + paragraph.getText());          
    }
System.out.println("Total no of paragraph in Docx : "+paragraphs.size());

Hope this helps!

0

精彩评论

暂无评论...
验证码 换一张
取 消