开发者

Extracting text from PDF document - C# [duplicate]

开发者 https://www.devze.com 2022-12-20 13:06 出处:网络
This question already has answers here: Extracting text from PDFs in C# [closed] (6 answers) Closed 3 years ago.
This question already has answers here: Extracting text from PDFs in C# [closed] (6 answers) Closed 3 years ago.

Is there a reliable way to extract text from PDF? The first thought that comes to mind is that PDF开发者_开发问答 may have multiple columns and the extraction mechanism needs to know the logical structure somehow. I understand that some PDF docs are "tagged" but I'd need to support pretty much any PDF document.

Any third party components to the rescue here?


Please see: Extracting text from PDFs in C#


Some PDFs are scans, so OCR would be required (not easy, to say the least).

Some PDFs are compressed, others (more rarely) are bare PDFs.

The PDF file format itself is well-documented, but when it comes to extracting the right "structure" from anything but a simple one-column document, you're asking for a tall order. PDF sort of represents, internally, how HTML might look if every line of text was positioned in DIVs with absolute positioning.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号