How do I filter chat messages by normalizing letter forms?_问答_开发者

How do I filter chat messages by normalizing letter forms?

开发者 https://www.devze.com 2023-01-19 17:25 出处：网络

I\'m filtering chat messages on a chat system where constraining strings to Latin-1 English is desirable. Users tend to use creative ty开发者_StackOverflow社区ping, e.g.

I'm filtering chat messages on a chat system where constraining strings to Latin-1 English is desirable. Users tend to use creative ty开发者_StackOverflow社区ping, e.g.

ßòógīě§

instead of

Boogies

In Java, there are unicode normalization methods which can remove diacritic marks, but I'm more interested in methods of normalizing the shapes of the letters towards English, and the Latin-1 character set.

Are there any tables, libraries or methods out there that can map common unicode characters outside Latin-1 to their nearest forms, visually? E.g.

ß -> B
§ -> S
¥ -> Y
¤ -> o

I suspect that the answer is "No, this would be too big, just filter them all out instead" but I can hope...

I think your best bet is to use an OCR (optical character recognition) engine. After all, that's precisely what you're after: A best effort to parse the letters into readable A-Z characters. (Remember to print the chat-messages onto an image using the same font as used in your chat-client.)

Two Java-OCR libraries:

Asprise
Tesseract

The correct solution is not to install idiotic "profanity filters" (which I assume are behind this request). If the community cannot police itself at all in that regard, moderate it manually and ban offenders, or shut it down. Having to wrestle with the Scunthorpe problem will offend your users much more than some swearing kids.