I am creating a search engine ( for studying ) and I want to know how Google recognizes adult content and images with Safesearch ( http://en.wikipedia.org/wiki/Safesearch ).
The program 开发者_StackOverflowlanguage doesn't matter, I want to know only the approach for a generic program language.
If the rules for any sort of content filter fell into the hands of people trying to get that content through the filter, the filter would become ineffective.
So I imagine that Google's rules (1) are not publicly available and (2) change frequently.
That said, starting with a small blacklist of adult sites and following outgoing links (and/or finding sites with links to the blacklisted sites) probably finds a huge number of adult sites. But by no means all, you'd want some sort of text processing and image recognition algorithms in addition.
NOTE: A popular theory is that adult content providers pay people to ask questions on stackoverflow.com so that Jon Skeet and Marc Gravell will have less time to update the SafeSearch filters. However, it is easily shown that Jon and Marc answer questions at such a high rate that any such strategy would not be economically viable.
Ben's answer is correct about all points, but I would like to add my considerations.
About image recognition: you will find pretty easy, given a large set of images, to identify objects like naked breasts, penises and such inside of them using pattern recognition.
All artificial intelligence algorithms, however, have weak points. You might experience that a certain percentage of your images, depending on the quality of the classificator used, is misclassified.
Then, you have to apply other criteria more than image processing. Surely Google's criteria are not public, but you would like to consider ICRA tags for volountarily marking certain material as adult material, text processing and cross-domain links. If I was the creator of the Safesearch, I would have adopted the following pattern: adult sites often exchange links, so you'll find lots of intersections in the link graphs between a group of adult sites.
Putting it all together, a good classification approach uses several smaller criteria, scoring them to determine whether an image is an adult image or not.
Possibly in a similar way to how spam is filtered.
First step is to create a training set, based on known adult sites, and extract features from them. These could be keywords, colors used in images, domain name structure, whois details, whatever. Anything that could in some way be specifically different for adult content as compared to non-adult content.
Next step is to apply some sort of statistical model to that. Bayesian models seem to work well for spam, but may not for adult stuff.
Support vector machines seem like a good fit, but that's a lot more complex and I'm not really familiar with it myself.
精彩评论