开发者

How to extract font styles of text contents using pdfbox?

开发者 https://www.devze.com 2023-03-26 07:51 出处:网络
I am using pdfbox开发者_C百科 library to extract text contents from pdf file.I would able to extract all the text,but couldn\'t find the method to extract font styles.This is not the right way to extr

I am using pdfbox开发者_C百科 library to extract text contents from pdf file.I would able to extract all the text,but couldn't find the method to extract font styles.


This is not the right way to extract font. To read font one has to iterate through pdf pages and extract font as below:

PDDocument  doc = PDDocument.load("C:/mydoc3.pdf");
List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages){
    Map<String,PDFont> pageFonts=page.getResources().getFonts();
}


import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class pdf2box {
    public static void main(String args[])
    {
        try
        {
    PDDocument pddDocument=PDDocument.load("table2.pdf");
    PDFTextStripper textStripper=new PDFTextStripper();
    System.out.println(textStripper.getText(pddDocument));
    textStripper.getFonts();



    pddDocument.close();
        }
        catch(Exception ex)
        {
        ex.printStackTrace();
        }
    }


}


File file = new File("sample.pdf");
        PDDocument document = PDDocument.load(file);

        for (int i = 0; i < document.getNumberOfPages(); ++i)
        {
            PDPage page = document.getPage(i);
            PDResources res = page.getResources();
            for (COSName fontName : res.getFontNames())
            {
                PDFont font = res.getFont(fontName);
                System.out.println(font.getName());

            }
        }
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号