Where can I find algorithms that values the spelling of misplaced characters more accurately than levenshtein() and php similar_text() methods?
Example:
similar_text('jonas', 'xxjon', $similar); echo $similar; // returns 60
similar_text('jonas', 'asjon', $similar); echo $similar; // returns 60 <- although more similar!
echo levenshtein('jonas', 'xxjon'); // returns 4
echo levenshtein('jonas', 'asjon'); // retur开发者_StackOverflow中文版ns 4 <- although more similar!
/ Jonas
Here's a solution that I've come up to. It's based on Tim's suggestion of comparing the order of subsequent charachters. Some results:
- jonas / jonax : 0.8
- jonas / sjona : 0.68
- jonas / sjonas : 0.66
- jonas / asjon : 0.52
- jonas / xxjon : 0.36
I'm sure i isn't perfect, and that it could be optimized, but nevertheless it seems to produce the results that I'm after... One weak spot is that when strings have different length, it produces different result when the values are swapped...
static public function string_compare($str_a, $str_b)
{
$length = strlen($str_a);
$length_b = strlen($str_b);
$i = 0;
$segmentcount = 0;
$segmentsinfo = array();
$segment = '';
while ($i < $length)
{
$char = substr($str_a, $i, 1);
if (strpos($str_b, $char) !== FALSE)
{
$segment = $segment.$char;
if (strpos($str_b, $segment) !== FALSE)
{
$segmentpos_a = $i - strlen($segment) + 1;
$segmentpos_b = strpos($str_b, $segment);
$positiondiff = abs($segmentpos_a - $segmentpos_b);
$posfactor = ($length - $positiondiff) / $length_b; // <-- ?
$lengthfactor = strlen($segment)/$length;
$segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
}
else
{
$segment = '';
$i--;
$segmentcount++;
}
}
else
{
$segment = '';
$segmentcount++;
}
$i++;
}
// PHP 5.3 lambda in array_map
$totalscore = array_sum(array_map(function($v) { return $v['score']; }, $segmentsinfo));
return $totalscore;
}
In addition to levenshtein() and similar_text(), there's also:
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
Please, be careful about using string_compare :
ivanov ivan / ivanov ivan : 1 OK!
ivanov ivan2 / ivanov ivan : 1 o_O
ivanov ivan / ivanov i : 1.1363636363636 OMG!
I've found that Jaro Winkler is also good for spelling mistakes and small differences between strings. I modified this code to be object-oriented:
class StringCompareJaroWinkler
{
public function compare($str1, $str2)
{
return $this->JaroWinkler($str1, $str2, $PREFIXSCALE = 0.1 );
}
private function getCommonCharacters( $string1, $string2, $allowedDistance ){
$str1_len = mb_strlen($string1);
$str2_len = mb_strlen($string2);
$temp_string2 = str_split($string2);
$commonCharacters='';
for( $i=0; $i < $str1_len; $i++){
$noMatch = True;
// compare if char does match inside given allowedDistance
// and if it does add it to commonCharacters
for( $j= max( 0, $i-$allowedDistance ); $noMatch && $j < min( $i + $allowedDistance + 1, $str2_len ); $j++){
if( $temp_string2[$j] == $string1[$i] ){
$noMatch = False;
$commonCharacters .= $string1[$i];
$temp_string2[$j] = '';
}
}
}
return $commonCharacters;
}
private function Jaro( $string1, $string2 ){
$str1_len = mb_strlen( $string1 );
$str2_len = mb_strlen( $string2 );
// theoretical distance
$distance = (int) floor(min( $str1_len, $str2_len ) / 2.0);
// get common characters
$commons1 = $this->getCommonCharacters( $string1, $string2, $distance );
$commons2 = $this->getCommonCharacters( $string2, $string1, $distance );
if( ($commons1_len = mb_strlen( $commons1 )) == 0) return 0;
if( ($commons2_len = mb_strlen( $commons2 )) == 0) return 0;
// calculate transpositions
$transpositions = 0;
$upperBound = min( $commons1_len, $commons2_len );
for( $i = 0; $i < $upperBound; $i++){
if( $commons1[$i] != $commons2[$i] ) $transpositions++;
}
$transpositions /= 2.0;
// return the Jaro distance
return ($commons1_len/($str1_len) + $commons2_len/($str2_len) + ($commons1_len - $transpositions)/($commons1_len)) / 3.0;
}
private function getPrefixLength( $string1, $string2, $MINPREFIXLENGTH = 4 ){
$n = min( array( $MINPREFIXLENGTH, mb_strlen($string1), mb_strlen($string2) ) );
for($i = 0; $i < $n; $i++){
if( $string1[$i] != $string2[$i] ){
// return index of first occurrence of different characters
return $i;
}
}
// first n characters are the same
return $n;
}
private function JaroWinkler($string1, $string2, $PREFIXSCALE = 0.1 ){
$JaroDistance = $this->Jaro( $string1, $string2 );
$prefixLength = $this->getPrefixLength( $string1, $string2 );
return $JaroDistance + $prefixLength * $PREFIXSCALE * (1.0 - $JaroDistance);
}
}
$jw = new StringCompareJaroWinkler();
echo $jw->compare("jonas","asjon");
@Tim: I'm actually looking for a way to process/measure similarities in a pedagogical game context. Let's say that a student's task is to select objects from a pool, and put those objects in a specific order (sort them by alphabet or whatever). I then need a way to measure the similarity between the students answer and the correct one
Algorithms to calculate the degree-of-correctness of the order of characters in a word (i.e. its spelling) could be very different from an algorithm to measure the correct order of words in a list. The way spelling algorithms handle omissions or dittography or transpositions might not apply very well to your use case.
If you know the order of elements in advance, and know the number of elements too, then you could simply loop through the answer and compare value-at-position to correct-value-at-position and arrive at a percentage-correct. Yet that would be a crude measure, and misleading, for if the goal of your game was to test, say, whether the gamer understood alphabetic sorting, and the gamer happened to get the first word wrong, every word could be in the wrong position even if the words were in otherwise correct alphabetic order:
banana
blackberry
blueberry
cherry
fig
grapefruit
orange
pear
persimmon
raspberry
apple
So what you could do to improve the accuracy of your measurement in our hypothetical situation is this: loop through the gamer's answer-list looking to see if the answer value is immediately followed by the correct word; every time a word is followed by the correct word, you would give the gamer a point. The gamer who produced the list above would get 9 points out of a possible 10 and that score would indeed accurately reflect the gamer's understanding of the rules of alphabetic sorting.
精彩评论