I am having a PDF with tables in it. The main objective is to have the similar table structure being reflected in ExcelSheet.
Reading the PDF stream with iTextSharp or PDFSharp I could get the plain text with loosing the structure of the table as in plain text the stream which previously had the coordin开发者_JAVA百科ate values for the text elements are being stripped out.
How can I deal with the stream using the coordinates to place my text values in exact positions in excel
I had the same problem of importing tabular parts of a PDF into Excel. I did the following way:
- manually open PDF, select all and copy
- manually change to Excel
- start a VBA which reads the clipboard, parses the data and writes out to the sheet
problem here was that the data in the buffer is not arranged horizontally - as you would expect - but vertically. So I had to develop some code around this as well. I used a class module to impelment functions like "next word", "next line", "search for word" etc.
I am happy to share this code if it helps.
EDIT:
I make use of a MSForms.DataObject
to read the Clipboard. After having created a reference to Microsoft Forms 2.0 Object library (...\system32\FM20.DLL), create a new Class Module named ClipClass
and put the following code in:
Public P As Integer ' line pointer
Public T As String ' total text buffer
Public L As String ' current line
Public Property Get FirstLine() As String
P = 1
FirstLine = NextLine()
End Property
Public Property Get NextLine() As String
L = ""
Do Until Mid(T, P, 2) = vbCrLf
L = L & Mid(T, P, 1)
P = P + 1
Loop
NextLine = L
P = P + 2
End Property
Public Property Get FindLine(Arg As String) As String
Dim Tmp As String
Tmp = FirstLine()
Do Until Tmp = Arg
Tmp = NextLine()
Loop
FindLine = Tmp
End Property
Private Sub Class_Initialize()
Dim Buf As MSForms.DataObject
Set Buf = New MSForms.DataObject ' this object interfaces with the clipboard
Buf.GetFromClipboard ' copy Clipboard to Object
T = Buf.GetText ' copy text from Object to string var
L = ""
P = 1
Set Buf = Nothing ' clean up
End Sub
This gives you all the functions you need to find a string and read out lines. Now for the fun part .... in my case I have a constant string in the PDF which always is situated 3 lines above the first table cell; and all table cells are arranged col by col in the text buffer. This is the Parser which is called by a button on the Excel sheet
Sub Parse()
Dim C As ClipClass, Tmp As String, WS As Range
Dim WSRow As Integer, WSCol As Integer
' initialize
Set WS = Worksheets("Table").[A1]
Set C = New ClipClass ' this creates the class instance and implicitely
' fires its Initialize() code which grabs the Clipboard
' get to head of table
Tmp = C.FindLine("identifying string before table starts")
' advance to one line before first table field - each field is terminated by CRLF
Tmp = C.NextLine
Tmp = C.NextLine
' PDF table is 3 col's x 7 rows organized col by col
For WSCol = 1 To 3
For WSRow = 1 To 7
WS(WSRow, WSCol) = C.NextLine
Next WSRow
Next WSCol
End Sub
In order to achieve the same first the PDF was read using iTextSharp (also tried with PDFCLown). The individual chunks with its coordinates were fetched from the PDF. As the PDF was following a similar pattern which was an Invoice file, logically the data were fetched accordingly and then with the help of NPOI the resulting excel format was achieved.
精彩评论