开发者

paragraph comparison in PHP

开发者 https://www.devze.com 2023-02-05 04:13 出处:网络
i was wondering... let\'s say 开发者_高级运维i have a webpage that crawls articles from the web. all i get is the title and the article in plain-text. is there a PHP script or webservice that can rela

i was wondering... let's say 开发者_高级运维i have a webpage that crawls articles from the web. all i get is the title and the article in plain-text. is there a PHP script or webservice that can relate articles between them? or... is there a PHP script that can generate keywords from a paragraph?

i have tested a script in JAVA that works, but maybe there's a PHPclass somewhere that can help...

thanks!


The functions from this answer can be used to extract words from text and compare them against each other. Rough example:

// For better results grab the texts manually and paste them here.
$nyt = file_get_contents('http://www.nytimes.com/2011/01/19/technology/19apple.html?pagewanted=print');
$sfc = file_get_contents('http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2011/01/19/BUAK1HARUL.DTL&type=business');

$nyt = strip_tags($nyt);
$sfc = strip_tags($sfc);

// stopwords from english snowball porter stemmer
$stopwordsFile = dirname(__FILE__).'/includes/stopwords_en.txt';
if (file_exists($stopwordsFile)) {
    $stopwords = file($stopwordsFile, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
} else {
    $stopwords = array();
}

$nytWords = extractWords($nyt, 3, $stopwords);
$sfcWords = extractWords($sfc, 3, $stopwords);

$nyt2sfcCount = countKeywords($nytWords, $sfcWords, 4);
$sfc2nytCount = countKeywords($sfcWords, $nytWords, 4);

// absolute
print_r($nyt2sfcCount);
print_r($sfc2nytCount);

$nyt2sfcFactor = strlen($sfc) / strlen($nyt);
$sfc2nytFactor = strlen($nyt) / strlen($sfc);

print($nyt2sfcFactor . PHP_EOL);
print($sfc2nytFactor . PHP_EOL);

foreach ($nyt2sfcCount as $word => $count) {
    $nyt2sfcCountRel[$word] = $count * $nyt2sfcFactor;
}

foreach ($sfc2nytCount as $word => $count) {
    $sfc2nytCountRel[$word] = $count * $sfc2nytFactor;
}

// relative
print_r($nyt2sfcCountRel);
print_r($sfc2nytCount);
print_r($nyt2sfcCount);
print_r($sfc2nytCountRel);

// reduce
$nyt2sfcCountRed = array_intersect_key($nyt2sfcCount, $sfc2nytCount);
$sfc2nytCountRed = array_intersect_key($sfc2nytCount, $nyt2sfcCount);

// reduced absolute
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRed);

foreach ($nyt2sfcCountRed as $word => $count) {
    $nyt2sfcCountRedRel[$word] = $count * $nyt2sfcFactor;
}

foreach ($sfc2nytCountRed as $word => $count) {
    $sfc2nytCountRedRel[$word] = $count * $sfc2nytFactor;
}

// reduced relative
print_r($nyt2sfcCountRedRel);
print_r($sfc2nytCountRed);
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRedRel);
0

精彩评论

暂无评论...
验证码 换一张
取 消