Badword filter in PHP?_问答_开发者_运维开发者技术经验分享

I am writing a badword filter in PHP.

I have a list of badwords in an array and the method cleanse_text() is written like this:

public static function cleanse_text($originalstring){
   if (!self::$is_sorted) self::doSort();
   return str_ireplace(self::$badwords, '****', $originalstring);
}

This works trivially, for exact matches, but I wanted to also censor words that have be开发者_JS百科en disguised like 'ab*d' where 'abcd' is a bad word. This is proving to be a bit more difficult.

Here are my questions:

Is a badword filter worth bothering with (it is a site for professionals so a certain minimum decorum is required - I would have thought)
Is it worth the hustle of trying to capture obvious work arounds like 'f*ck' - or should I not attempt to filter those out.
Is there a better way of writing the cleanse_text() method above?

I definitely wouldn't bother with it.

It's a site for professionals, so you can assume that they will act appropriately. Some moderation and enforcement of rules will put anyone in line. Look at Stack Overflow for example. Even without the community moderation tools, people can be pressured into behaving appropriately.
It would fail. There would be too many false positives ("clbuttic"), and making a list which contained all possible swear words would be impossible to maintain. Replacing certain letters (eg: f*ck) makes it no less offensive. Removing the word altogether destroys meaning, which is a huge problem with false positives.
Consider a discussion about donkeys and birds. It'd be all about asses, tits, boobies and cocks.

If it's a website for professionals, then don't bother. You won't see much cursing in the first place, and when you do it will most likely be for comedic effect or similar. The people that do swear a lot in an immature manor will be punished by simply making a bad impression on everyone. (And those who completely overdo it should be dealt with by moderators anyway, so that shouldn't be an issue.)

What happens when you try to implement a bad word filter is you end up censoring completely benign uses of swear words, and in many cases, you also censor words that are not swear words but are similar enough for the filter to catch. (It's called the Scunthorpe problem, as @deceze mentioned in the comments.) Also, unless you go all-out, it will be really easy to circumvent. All-in-all, I'd say it's not worth the effort.

Take Stack Overflow as an example. It has no bad word filter, and it's doing just fine--I haven't heard of any problems with that kind of thing.

Okay, here's a different idea:

I don't know what content you are filtering, but I'll just assume it's a comment system since this will still apply for whatever else it might be.

You probably have some kind of administrative interface. What if every time someone includes a possible "bad word" in a comment it leaves a note for you in said interface. Or sends you a daily email of all the maybe-profanity that appeared on your site. There could be links next to each listing that when clicked, would automatically apply a filter to that comment/post/whatever, or delete it, or whatever you want. Then you could just glance at the report, click once or twice to clean up the site, and be done with it.

You might think this wouldn't scale. It probably wouldn't. But if your site doesn't get a tonne of traffic, you might not even get a report every day. Or every week. You might not have to intervene much at all. No lists, no thinking of every possible objectionable word and all of their possible spellings, no false-positives.

It could work.

It's all 8u11$#1+ anyway. Just post a human-readable rule, let humans flag offensive contributions and ban offenders.

There are so many cases where you would want to implement this functionality. Will it ever be 100% correct/secure/failsafe? Of course not, but do tell me what is!

If you combine the OPs request (which auto-flags the post/user-input) with a 'report this' functionality for the general user, you have a really strong system. Most large corporations and business entities use this "double up" system, which, combines "admin staff reviews" of auto-flagged inputs and the ability for the normal user to also report anything that has slipped under the radar.

function spam_found($full_string){
    $spam = array('100%', '100 %', '110%', '110 %', 'free');
    $i = 0;
    foreach ($spam as $spamword) {
        if (strrpos($full_string, $spamword)) {
            $i++;
        }
    }
    if($i > 0){
        return true;
    }else{
        return false;
    }
}

This is just a very basic function "template" that we use at my company (used in a MVC environment, this is actually a helper function called from a controller, we have many such functions depending on the kind of input). This could be easily adapted to a range of situations, for example, only return true if 3 of your spam words are found.

For example, in Australia, if you are advertising a job, it is illegal to specify that it is only for a specific gender. That's right, if you are looking for a girl to work in a strip club, you are not allowed to say "only looking for girls". If a website is found to have such a position advertised (i.e, on Facebook), the website (not the advertiser/person who posted the ad), is liable for any possible criminal or civil charges/lawsuits.

In the above mentioned case, it would make perfect sense for Facebook to do a "spam check" for any jobs advertised in Australia that mentioned the words "male, female, guys, girls, etc.).