开发者

How to detect whether a text file is converted through OCR

开发者 https://www.devze.com 2023-02-20 05:30 出处:网络
I want to make an application using C#, to check a fi开发者_如何学Gole whether it is converted through OCR or typed through keyboardWhen I\'m reading something, I can usually tell if it\'s OCRed by se

I want to make an application using C#, to check a fi开发者_如何学Gole whether it is converted through OCR or typed through keyboard


When I'm reading something, I can usually tell if it's OCRed by seeing spelling errors that are the result of substituting simular characters for correct ones. For example, O and o, S and s, 1 and l or I, rn and m, and so on. If you write your program to look for those unusual anomalies, you can probably detect OCRed text.

Similarly, you can look for other spelling errors that typically indicate typed text. For example, transposed letters (teh) or letters substituted for one next to them on the keyboard are likely indicators that a text was typed in.


This job can be tough to solve in general, and easy to solve for specific cases.

For example, if your OCR software inserts a bunch of non-ASCII characters, and all your documents contain only the letter A through Z, the lower-case letters a-z, digits, and punctuation, then your job is fairly simple.

To solve that problem, you could use a for-loops on characters in the document, and use if statements like if(char.IsLetter(currentChar)) and if(char.IsDigit(currentChar)), or use char.GetUnicodeCategory in a switch-statement.

If there are specific words/letters it always gets wrong, you could make a Dictionary<string, bool> object, and populate it with words you know the OCR always gets wrong, and/or words that you know a human won't get wrong. Then, loop over all the words in your document and see if you get a match in your dictionary, proving that it is a human, or OCR.

If you're using OCR software that doesn't tend to screw things up in an easily detectable way, you'd have to resort to artificial intelligence to solve it. Hopefully you don't have to resort to this, because this is really hard stuff to program, and takes a lot of work to set up correctly and maintain. From your description and your comments, it sounds like you can use the easier solution.

No matter what, software to do this kind of job is going to get some of the documents wrong. The user may have typed in something strange, or copy/pasted in some non-ASCII character (such as the word résumé), or the OCR somehow didn't make any detectable mistakes. Hopefully you have a way to deal with this fact, or your situation isn't risky enough that this is a problem.


OCRed text almost always consists of one-line paragraphs. And OCR engines usually have trouble distinguishing some upper/lower case letters and letters with similar looking glyphs, such as S/s, V/v, X/x, O/o/0, 1/l/I, etc.

0

精彩评论

暂无评论...
验证码 换一张
取 消