开发者

How to sort an array by similarity in relation to an inputted word.

开发者 https://www.devze.com 2023-03-31 00:50 出处:网络
I have on PHP array, for example: $arr = array(\"hello\", \"try\", \"hel\", \"hey hello\"); Now I want to do rearrange of the array which will be based on the most nearly clo开发者_JS百科se words b

I have on PHP array, for example:

$arr = array("hello", "try", "hel", "hey hello");

Now I want to do rearrange of the array which will be based on the most nearly clo开发者_JS百科se words between the array and my $search var.

How can I do that?


This is a quick solution by using http://php.net/manual/en/function.similar-text.php:

This calculates the similarity between two strings as described in Programming Classics: Implementing the World's Best Algorithms by Oliver (ISBN 0-131-00413-1). Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or may not speed up the whole process. Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string.

    $input = 'Bradley123';

    $list = array('Bob', 'Brad', 'Britney');

    usort($list, fn($a, $b) => similar_text($input, $a) <=> similar_text($input, $b));
    
    var_dump($list); //output: array("Brad", "Britney", "Bob");

Or using http://php.net/manual/en/function.levenshtein.php:

The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive).

    $input = 'Bradley123';

    $list = array('Bob', 'Brad', 'Britney');

    usort($list, fn($a, $b) => levenshtein($input, $a) <=> levenshtein($input, $b));

    var_dump($list); //output: array("Britney", "Brad", "Bob");


You can use levenshtein function

<?php
// input misspelled word
$input = 'helllo';

// array of words to check against
$words  = array('hello' 'try', 'hel', 'hey hello');

// no shortest distance found, yet
$shortest = -1;

// loop through words to find the closest
foreach ($words as $word) {

    // calculate the distance between the input word,
    // and the current word
    $lev = levenshtein($input, $word);

    // check for an exact match
    if ($lev == 0) {

        // closest word is this one (exact match)
        $closest = $word;
        $shortest = 0;

        // break out of the loop; we've found an exact match
        break;
    }

    // if this distance is less than the next found shortest
    // distance, OR if a next shortest word has not yet been found
    if ($lev <= $shortest || $shortest < 0) {
        // set the closest match, and shortest distance
        $closest  = $word;
        $shortest = $lev;
    }
}

echo "Input word: $input\n";
if ($shortest == 0) {
    echo "Exact match found: $closest\n";
} else {
    echo "Did you mean: $closest?\n";
}

?>


if you want to sort your array, you can do this:

$arr = array("hello", "try", "hel", "hey hello");
$search = "hey"; //your search var

for($i=0; $i<count($arr); $i++) {
   $temp_arr[$i] = levenshtein($search, $arr[$i]);
}
asort($temp_arr);
foreach($temp_arr as $k => $v) {
    $sorted_arr[] = $arr[$k];
}

$sorted_arr should then be in descending order starting with the closest word to your search term.


While @yceruto's answer is correct and informative, I would like to extend additional insights and demonstrate more modern implementation syntax.

  • The three-way comparison operator (aka "spaceship operator") <=> from PHP7+
  • Arrow function syntax to allow extra variables into the custom function scope from PHP7.4+.

First about the generated scores from respective functions...

  1. levenshtein() and similar_text() ARE case-sensitive so an uppercase H is just as much a mismatch as the number 6 when compared to h.
  2. levenshtein() and similar_text() ARE NOT multi-byte aware so an accented character like ê will not only be deemed a mismatch for e, it will potentially receive a heavier penalty based on each individual byte being a mismatch.

If you want to make case-insensitive comparisons, you can simply convert both strings to uppercase/lowercase before executing.

If your application requires multi-byte support, you should search for existing repositories that provide this functionality.

Additional techniques for those willing to research more deeply include metaphone() and soundex(), but I will not delve into these topics in this answer.

Scores:

Test vs "hello" |  levenshtein   |  similar_text  |   similar_text's percent   |
----------------+----------------+----------------+----------------------------|
H3||0           |       5        |      0         |       0                    |
Hallo           |       2        |      3         |      60                    |
aloha           |       5        |      2         |      40                    |
h               |       4        |      1         |      33.333333333333       |
hallo           |       1        |      4         |      80                    |
hallå           |       3        |      3         |      54.545454545455       |
hel             |       2        |      3         |      75                    |
helicopter      |       6        |      4         |      53.333333333333       |
hellacious      |       5        |      5         |      66.666666666667       |
hello           |       0        |      5         |     100                    |
hello y'all     |       6        |      5         |      62.5                  |
hello yall      |       5        |      5         |      66.666666666667       |
helów           |       3        |      3         |      54.545454545455       |
hey hello       |       4        |      5         |      71.428571428571       |
hola            |       3        |      2         |      44.444444444444       |
hêllo           |       2        |      4         |      72.727272727273       |
mellow yellow   |       9        |      4         |      44.444444444444       |
try             |       5        |      0         |       0                    |

Sort by levenshtein() PHP7+ (Demo)

usort($testStrings, function($a, $b) use ($needle) {
    return levenshtein($needle, $a) <=> levenshtein($needle, $b);
});

Sort by levenshtein() PHP7.4+ (Demo)

usort($testStrings, fn($a, $b) => levenshtein($needle, $a) <=> levenshtein($needle, $b));

**Notice that $a and $b have changed sides of the <=> evaluation for DESC ordering. Notice that hello is not assured to be positioned as first element

Sort by similar_text() PHP7+ (Demo)

usort($testStrings, function($a, $b) use ($needle) {
    return similar_text($needle, $b) <=> similar_text($needle, $a);
});

Sort by similar_text() PHP7.4+ (Demo)

usort($testStrings, fn($a, $b) => similar_text($needle, $b) <=> similar_text($needle, $a));

Notice the difference in scoring of hallå and helicopter via similar_text()'s return value versus similar_text()'s percent value.

Sort by similar_text()'s percent PHP7+ (Demo)

usort($testStrings, function($a, $b) use ($needle) {
    similar_text($needle, $a, $percentA);
    similar_text($needle, $b, $percentB);
    return $percentB <=> $percentA;
});

Sort by similar_text()'s percent PHP7.4+ (Demo)

usort($testStrings, fn($a, $b) => 
    [is_int(similar_text($needle, $b, $percentB)), $percentB]
    <=>
    [is_int(similar_text($needle, $a, $percentA)), $percentA]
);

Notice that I am neutralizing the unwanted return value of similar_text() by converting its return value to true, then using the generated percent value -- this allows the generation of the percent value without returning too soon since arrow function syntax does not permit multi-line execution.


Efficiently sort by levenshtein() then only call similar_text() when a tiebreak is necessary, PHP7+ (Demo)

usort($testStrings, function($a, $b) use ($needle) {
    return levenshtein($needle, $a) <=> levenshtein($needle, $b)
           ?: similar_text($needle, $b) <=> similar_text($needle, $a);
});

Efficiently sort by levenshtein() then only call similar_text() and use its percent when a tiebreak is necessary, PHP7.4+ (Demo)

usort($testStrings, fn($a, $b) =>
    levenshtein($needle, $a) <=> levenshtein($needle, $b)
    ?: similar_text($needle, $b) <=> similar_text($needle, $a)
);

Personally, I never use anything but levenshtein() in my projects because it consistently delivers the results that I'm looking for.


To reduce total function calls and improve performance all of these approaches can be transferred to an array_multisort() implementation. You only need to build flat arrays of evaluations, then add those arrays as arguments, then the last argument should be the original array.


Another way is to use similar_text function which returns result in percents. See more http://www.php.net/manual/en/function.similar-text.php .

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号