I'm filtering chat messages on a chat system where constraining strings to Latin-1 English is desirable. Users tend to use creative ty开发者_StackOverflow社区ping, e.g.
ßòógīě§
instead of
Boogies
In Java, there are unicode normalization methods which can remove diacritic marks, but I'm more interested in methods of normalizing the shapes of the letters towards English, and the Latin-1 character set.
Are there any tables, libraries or methods out there that can map common unicode characters outside Latin-1 to their nearest forms, visually? E.g.
ß -> B
§ -> S
¥ -> Y
¤ -> o
I suspect that the answer is "No, this would be too big, just filter them all out instead" but I can hope...
I think your best bet is to use an OCR (optical character recognition) engine. After all, that's precisely what you're after: A best effort to parse the letters into readable A-Z characters. (Remember to print the chat-messages onto an image using the same font as used in your chat-client.)
Two Java-OCR libraries:
- Asprise
- Tesseract
The correct solution is not to install idiotic "profanity filters" (which I assume are behind this request). If the community cannot police itself at all in that regard, moderate it manually and ban offenders, or shut it down. Having to wrestle with the Scunthorpe problem will offend your users much more than some swearing kids.
精彩评论