I need help to create a script for finding keywords in a string, and inserting them into a database for u开发者_开发知识库se in a tag cloud.
- The script would need to obviously dismiss characters, and common words like 'I', 'at', 'and', etc.
- Get a value for the frequency of each keyword it finds and then insert it into the database if it's new, or update the existing row with the addition of the strings keyword count.
- The string is unformatted text from a database row.
I'm not new to PHP, but I haven't attempted anything like this before, so any help is appreciated.
Thanks, Lea
Google + php keywords from text
= http://www.hashbangcode.com/blog/extract-keywords-text-string-php-412.html
Well, the answer is already there, I still post my code for the little work that has gone into it.
I think that a mysql db is not ideal for storing this kind of data. I would suggest something like memcachedb, so you can easily access a keyword by using it as an index to fetch the count from the db. Persisting those keywords in a high load environment may cause problems with a mysql db.
$keyWords = extractKeyWords($text);
saveWords($keyWords);
function extractKeyWords($text) {
$result = array();
if(preg_match_all('#([\w]+)\b#i', $text, $matches)) {
foreach($matches[1] as $key => $match) {
// encode found word to safely use as key in array
$encodedKey = base64_encode(strtolower($match));
if(wordIsValid($match)) {
if(array_key_exists($encodedKey, $result)) {
$result[$encodedKey] = ++$result[$encodedKey];
} else {
$result[$encodedKey] = 1;
}
}
}
}
return $result;
}
function wordIsValid($word) {
$wordsToIgnore = array("to", "and", "if", "or", "by", "me", "you", "it", "as", "be", "the", "in");
// don't use words with a single character
if(strlen($word) > 1) {
if(in_array(strtolower($word), $wordsToIgnore)) {
return false;
} else {
return true;
}
} else {
return false;
}
}
// not implemented yet ;)
function saveWords($arrayOfWords) {
foreach($arrayOfWords as $word => $count) {
echo base64_decode($word).":".$count."\n";
}
}
You could approach this with a dictionary of keywords or a dictionary of words to ignore. If you make a dictionary of key words then count each time one is used and then update a database table with the keywords. If you make a dictionary of words to ignore then strip those words from posts and insert or update a count for all the remaining words into the keyword table.
The way does it is by storing every word entered in every post in a table. When people search the forum, the result is the post IDs from which the words came.I suggest something like this.
Compare a user submission with your array of blacklisted (obvious) words which would come from a database table. THe words that survive are your keywords. Enter those keywords into your database table. Then use a SELECT * statement from your table to return a result set. Use the array_count function as demonstrated to get your count.
Perhaps a better way is to do what most sites do and force the user to enter their keywords (Stackoverflow, delicious, etc.) That way you can skip all the parsing up front.
If the string is not too long and you won't have memory issues with storing the string in arrays, how about this?
# string to parse, comes from the database as you suggested
$string = 'I at and Cubs PHP Cubs';
# string is now an array
$stringArray = explode(" ", $string);
# list of "obvious" words to exclude, this would probably come from a database table
$wordsToExclude = array('I', 'at', 'and');
# array that contains your "keywords"
# Array('Cubs', 'PHP', 'Cubs')
$keywordArray = array_diff($stringArray, $wordsToExclude);
# array with the keyword as the key and the count as the value
# Array('Cubs' => 2, 'PHP' => 1)
$countedValues = array_count_values($keywordArray);
Now you need to "search" the database for the keys in the $countedValues array. What does your table look like?
Or of course you could avoid reinventing the wheel and Google "php tag cloud"...
Reference: PHP array functions
精彩评论