I was wondering If anyone knows the best way to pull out the top reoccurring keywords/phrases from a block of text in PHP.
I want to build my own tag cloud for an application I'm working on. The main tricky part would be pulling out 'muli-word' keywords such as "White House" and not recognising them as two separate words but one开发者_开发问答 phrase.
There must be a bunch of scripts out there for this purpose, just can't seem to find any!
Appreciate your help!
Here's a little chunk I used - it parses a comma-delimited string, and prints the size accordingly:
PHP
function cs_get_tag_cloud_data($data)
{
$data = str_replace(' ', '', $data);
$tagwords_arr = explode(",", $data);
$tags_arr = null;
for( $x=0; $x<sizeof($tagwords_arr); $x++)
{
$word_count = get_tag_count($tagwords_arr, $tagwords_arr[$x]);
if(in_tag_array($tags_arr, $tagwords_arr[$x]) == false)
{
$tags_arr[] = array("tag" => $tagwords_arr[$x], "count" => $word_count);
}
}
return $tags_arr;
}
# Get tag count
function get_tag_count($arr, $word)
{
$wordCount = 0;
for ( $i = 0; $i < sizeof($arr); $i++ )
{
if ( strtoupper($arr[$i]) == strtoupper($word) ) $wordCount++;
}
return $wordCount;
}
# check if word already exists
function in_tag_array($arr, $search)
{
$tag_exists = false;
if(sizeof($arr)>0)
{
for($b = 0; $b < sizeof($arr); $b++)
{
if (strtoupper($arr[$b]['tag']) == strtoupper($search))
{
$tag_exists = true;
break;
}
}
}
else
{
$tag_exists = false;
}
return $tag_exists;
}
HTML
<p id="tag-words">
<? $tag_data = cs_get_tag_cloud_data($cloud_data);
asort($tag_data);
for($x=0; $x<sizeof($tag_data); $x++)
{
$word = "";
$value = "";
$count = 0;
$font_size = 0;
$new_font_size = 0;
foreach($tag_data[$x] as $key => $value)
{
if($key == "tag") $word = $value;
if($key == "count") $count = $value;
if($count > 10) $count = 10;
if($count > 0)
{
$new_font_size = 0;
$font_size = 8;
$new_font_size = $font_size + ($count*3);
$word = preg_replace("/&#?[a-z0-9]+;/i","", $word);
echo '<a class="tag-link" style="font-size: ' . $new_font_size . 'px;" href="#">' . $word . '</a> ';
}
}
} ?>
</p>
It's just something I've used, but thought I'd share- maybe it helps you.
Edit: For two-word tags, you could just do something like "White-House" and then remove the dash when you're echoing. Just another thought.
精彩评论