Is there an option开发者_Go百科 to extract text from a PDF doc, with the ITextSharp library, and retain formatting eg. the new line and tab characters.
When extracting text the tab characters will come out, assuming that they actually are tab characters. I don't believe that new line characters can be determined without manually keeping track of the current text coordinates. You might be able to count the number of Td
tokens between BT
and ET
and subtract 1 but that's just a guess.
EDIT
Never mind on the token thing, I thought that was used only for line readjustment (new line) but I was wrong.
I suggest you write your own TextExtractionStrategy
based on LocationTextExtractionStrategy
.
You'll need to track where the baselines are to determine newlines.
Actually, LocationTextExtractionStrategy just might add the newlines for you. Either way, that's where you need to start.
It turns out the formatting "\r\n
" is indeed retained verified by fetching the value from SQL Server table programatically and invoking Console.writeline()
. Initially I was copying the value directly from SQL Server Management studio and pasting into text file - which surely isn't the right way to verify.
精彩评论