I'm trying to parse a bunch of PDF's that have a section of what appears to be text, but in reality is just a bunch of embedded shapes to look like text, so extracting that 'text' using the normal PdfTextExtractor object in iTextSharp is not possible.
Since the text I am trying to extract is one of only 10 possible words, instead of actuall开发者_开发知识库y 'reading' the word (or rather, 'shapes in the form of a word'), I figured I can determine what the word is by comparing it against others that I have already identified.
My first question is, How do I even get to this section of the PDF? How would I use iText to parse the document to drill down to this shape object? There is a common word that begins this section on all my documents, so I thought I can use that as a landmark to know when I'm in the right area, but how do I even iterate through all the shapes of the document?
Then, once I find it, how do I identify the particular shapes (line segments?) of the other words to determine what letters I'm looking at?
To illustrate the problem, here's a comparable scenario - The section I need to parse is a map legend, and it will be an area of the PDF that looks like this:
-- LEGEND --
- road
- highway
- river
If I find the shape representing the word 'LEGEND' I know I'm in the right area, and then I can try determining what words are in the legend (since it's a limited list of around 10 words). But how do I do that?
I'm using .NET, so any C# or VB.Net code samples should work for me.
You have my pity.
The only reasonable way to handle this sort of thing is through OCR. Optical Character Recognition. There's at least one decent open source OCR package to be found, on google code.
The Pdf Parser package doesn't handle line art In Any Way yet. So that's out unless you want to write the support yourself.
Once you have "known good" examples of each of your 10 words, you MIGHT be able to come up with a RegEx that will detect each one consistently. This will fail unless your "text" is always in the same "font".
You'll have to look for specific series of lineTo/curveTo/moveTo commands.
You'll have to ignore the coordinates in your RegEx, but then go back and parse them if you need to determine a bounding box for the given word.
Fun fun fun.
精彩评论