开发者

how to dynamically filter website content using PHP

开发者 https://www.devze.com 2023-01-13 20:00 出处:网络
I\'m currently looking for solution to dynamically filter website content. By \"dynamic\" I mean I would calculate the percentage of the bad words i.e. shit, f**k, etc over the whole words on the firs

I'm currently looking for solution to dynamically filter website content. By "dynamic" I mean I would calculate the percentage of the bad words i.e. shit, f**k, etc over the whole words on the first page. Say the website is allowed if the percentage is no more than 30%. How do I make it search each word on the first page and match them with the bad words list then divide by the total number of the words so then I would be able to get the percentage? The rationale is not to make a content filter but to just block the website should even a single word in the page matches with the bad words list. I have got this though, but it is of static.

$filename =   "filters.txt";

$fp = @fopen($filename, 'r');

if ($fp) {

$array = explode("\n", fread($fp, filesize($filename)));

foreach($array as $key => $val){

list($before,$after) = split("~",$val);

$input = preg_replace($before,$after,$input);

}
}

*filter.txt contains the list of bad words


Thanx Erisco!

Tried this but it doesnt seem to work thou.

function get_content($url)
{
   $ch = curl_init();

   curl_setopt ($ch, CURLOPT_URL, $url);
   curl_seto开发者_StackOverflow中文版pt ($ch, CURLOPT_HEADER, 0);

   ob_start();

   curl_exec ($ch);
   curl_close ($ch);
   $string = ob_get_contents();

   ob_end_clean();

   return $string;    

}


/* $toLoad is from Browse.php */

$sourceOfWebpage = get_content($toLoad);
$textOfWebpage = strip_tags($sourceOfWebpage);

/* array: Obtained by your filter.txt file */
// Open the filters file and filter all of the results.

$filename =   "filters.txt";
$badWords = @fopen($filename, 'r');

if ($badWords) {
  $array = explode("\n", fread($fp, filesize($filename)));

  foreach($array as $key => $val){
    list($before,$after) = split("~",$val);
    $input = preg_replace($before,$after,$input);
  }
}

/* float: Some decimal value */

$allowedBadWordsPercent = 0.30;
$numberOfWords = str_word_count($textOfWebpage);
$numberOfBadWords = 0;
str_ireplace($badWords, '', $sourceOfWebpage, $numberOfBadWords);

if ($numberOfBadWords != 0) {
    $badWordsPercent = $numberOfWords / $numberOfBadWords;
} else {
    $badWordsPercent = 0;
}

if ($badWordsPercent > $allowedBadWordsPercent) {
    echo 'This is a naughty webpage';
}


This is the rough idea of what I'd do. You could argue that using str_ireplace() purely for the count is devious though. I am not sure if there is a more direction function without busting out the regexp.

/* string: Obtained by CURL or similar */
$sourceOfWebpage;

$textOfWebpage = strip_tags($sourceOfWebpage);

/* array: Obtained by your filter.txt file */
$badWords;

/* float: Some decimal value */
$allowedBadWordsPercent = 0.30;

$numberOfWords = str_word_count($textOfWebpage);
$numberOfBadWords = 0;

str_ireplace($badWords, '', $sourceOfWebpage, $numberOfBadWords);

if ($numberOfBadWords != 0) {
    $badWordsPercent = $numberOfWords / $numberOfBadWords;
} else {
    $badWordsPercent = 0;
}

if ($badWordsPercent > $allowedBadWordsPercent) {
    echo 'This is a naughty webpage';
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号