I have been using tesseract (Ver 3) on linux to extract text from scanned pdf files. The problem that the whole process is slow, very slow. For example, extracting this (http://www.a-pdf.com/scan-paper/a-pdf-scan-paper-doc.pdf) 20 page document takes 514 seconds (8+ min)
to convert the pdf I used Image Magick convert application. bellow the set command that I use.
convert -density 288 src.pdf -colorspace Gray -depth 8 -alpha off tmp.tif
tesseract tmp.tif out.txt
Note, that that 288 dpi is required since otherw开发者_StackOverflowise tesseract fails completely in extracting text from the scaned file that I tested.
Does any one know how I can speed things up without effect the quality of the result?
Try VietOCR to see if it could produce faster results as you want. It can accept PDF if Ghostscript is installed.
精彩评论