开发者

text layout recognition with python

开发者 https://www.devze.com 2023-03-19 06:33 出处:网络
I\'m trying to sort through several thousand scanned files and sort them into folders based on type (ie: if one of the 开发者_开发技巧files is a scanned copy of formA, then it should go in the formA f

I'm trying to sort through several thousand scanned files and sort them into folders based on type (ie: if one of the 开发者_开发技巧files is a scanned copy of formA, then it should go in the formA folder, if it's a scanned copy of formB, then it should go in the formB folder, etc...). I feel like the best way to match the files and types is based on their text outlines, but am totally new to image processing, so if there's a better solution, then I'm all ears.

I'm working in python. Any ideas of a best way to do this? PIL? OpenCV? imageMagick?

Thanks in advance...


This library is probably of interest to you -
http://code.google.com/p/ocropus/
Its made by googlers and lets you do OCR and layout analysis from python.
I had some trouble installing it, but that was quite a while back, so things may have gotten fixed by now.


I don't know in what format you've got the scanned documents, but pdfminer can do layout analysis for pdf. I guess it would fit the bill for your purpose, provided you get the documents in somewhat decent pdf format (if you've just got "pure images", it won't do you any good)

0

精彩评论

暂无评论...
验证码 换一张
取 消