I am extracting texts from OCRed Tiff files by using a library and dumping it in database. The text I am extracting are actually FORMS having fields like NAME,DOB,COU开发者_开发技巧NTRY etc. Since OCR does not the difference between actual value and the label,it's just dumping all text. Now I have text in DB in following format:
Name: MyName Address: My Address
etc
Now the next step is to extract values lile MyName and MyAddrss from the DB. The document types may varry hence a generic parser might not work.
What would you suggest to deal this situation? Should I write different parsers? may ANTLR can help me? if yes then how? Kindly guide me.
I am working on .NET
精彩评论