开发者

Extracting paragraph from pdf

开发者 https://www.devze.com 2023-02-16 17:50 出处：网络

I\'m doing topic modelling on a pdf e-book and need to extract text paragraph by paragraph. For this I use apache pdfBox which is efficiently extract text from pdf.

相关专题：extract pdf pdfbox

I'm doing topic modelling on a pdf e-book and need to extract text paragraph by paragraph. For this I use apache pdfBox which is efficiently extract text from pdf.

PDFPars开发者_StackOverflower parser;
PDFTextStripper pdfStrip = null;
parsedText = pdfStrip.getText(pdDoc);

But I cannot extract paragraphs separately. This tool provides a way to set the paragraph start/end identifier, but I need to know the paragraph break identifier for this.

Is there a way to do this, or if there some other tool available which can do paragraph extraction effectively?

PdfNitro is best tool I found for extracting paragraph.

The only problem with this tool is it considers a page-break as a paragraph break, otherwise it works well. This tool is available in 14 days trial version to test.