Now, I have a website which crawls images. The images are served based on their preference of whether unsafe (18+) images are allowed or not.
Right now we sort out the images ourself and it takes a very long time since we get a lot of image submissions per day.
I know google does it pretty well.
I just want the images of s开发者_如何学Pythonexual and pornographic nature to be sorted out. Girls in bikini are fine.
I had an idea in mind, where the program would search an image for the patterns of the images that I dont want to be shown. For example searching images for the privates and then if the pattern is found mark it as unsafe.
I was wondering if there was any program or algorithm in php that can be used to perform this for us?
Even though SimpleCoder's solution is by far more sophisticated than this, I would still recommend manually moderating the images. Unless you spend thousands of dollars making some extremely advanced algorithm, you will always have false positives and negatives. Just as a little experiment, I went to http://pikture.logikit.net/Demo/index and uploaded 8 images. 6 were clean and 2 were explicit. Of the two explicit images, one was falsely marked as clean. Of the six clean images, four were falsely marked as explicit. Mind you, I purposely tried to fool it by choosing images that I thought a computer would get confused with, and it turns out it was pretty easy. Their program scored a measly 37.5%.
Here are a few recommendations of things that should at least make life somewhat easier for the moderators and shouldn't be too difficult to implement programatically:
1) Take all currently rejected images (if possible) and hash the files and store the hashes in a database. Hash all new submissions when they come in and verify the hash against the already existing hashes. If a match is found, automatically flag it. When an admin manually rejects the image, add this hash to the database as well. This will at least prevent you from having to re-flag duplicates.
2) Add weight to the $isPornScore to all images from entire domains if any explicit content is found in any file on that domain. Perhaps more weight should be given for multiple occurrences from one domain. Do similarly to domains hotlinking to images on these domains.
3) Check the domain name itself. If it contains explicit language, add to the $isPornScore. Also the same should be done to the URI of both the image and the page containing the anchor tag (if different).
4) Check the text around the image. Even though this isn't 100% accurate, if you have a blatant "Farm sexxx with three women and ..." somewhere on a page, you can at least increase the weight that all images on that page (or domain) will be explicit.
5) Use any other techniques or criteria you can and apply an overall "score" to the image. Then use your own judgment and/or trial and error and if the score is higher than a certain amount, automatically flag it as being explicit and flag it. Try to reach an acceptable balance between false positives and whatever the cost of having the explicit image not be flagged is. If it is not automatically flagged as explicit, still require moderator intervention.
I'm assuming you want to filter based on image content, and not context (e.g. what words are around the image).
That's some pretty intense AI. You will need to train an algorithm so it can 'learn' what an unsafe image looks like. Here is a great paper on the subject: http://www.stanford.edu/class/cs229/proj2005/HabisKrsmanovic-ExplicitImageFilter.pdf
精彩评论