开发者

How to match rows in array to an array of masks?

开发者 https://www.devze.com 2023-04-05 23:26 出处:网络
I have array like this: array(\'1224*\', \'543*\', \'321*\' ...) which contains about 17,00 "masks" or prefixes.

I have array like this:

array('1224*', '543*', '321*' ...) which contains about 17,00 "masks" or prefixes.

I have a sec开发者_如何学Pythonond array:

array('123456789', '123456788', '987654321' ....) which contain about 250,000 numbers.

Now, how can I efficiently match every number from the second array using the array of masks/prefixes?

[EDIT]

The first array contains only prefixes and every entry has only one * at the end.


Well, here's a solution:

Prelimary steps:

  1. Sort array 1, cutting off the *'s.

Searching:

  1. For each number in array 2 do
    1. Find the first and last entry in array 1 of which the first character matches that of number (binary search).
    2. Do the same for the second character, this time searching not the whole array but between first and last (binary search).
    3. Repeat 2 for the nth character until a string is found.

This should be O(k*n*log(n)) where n is the average number length (in digits) and k the number of numbers.


Basically this is a 1 dimensional Radix tree, for optimal performance you should implement it, but it can be quite hard.


My two cents....

$s = array('1234*', '543*', '321*');
$f = array('123456789', '123456788', '987654321');

foreach ($f as $haystack) {
    echo $haystack."<br>";
    foreach ($s as $needle) {
        $needle = str_replace("*","",$needle);
        echo $haystack "- ".$needle.": ".startsWith($haystack, $needle)."<br>";
    }
}

function startsWith($haystack, $needle) {
    $length = strlen($needle);
    return (substr($haystack, 0, $length) === $needle);
}

To improve performance it might be a good idea to sort both arrays first and to add an exit clause in the inner foreach loop.

By the way, the startWith-function is from this great solution in SO: startsWith() and endsWith() functions in PHP


Another option would to be use preg_grep in a loop:

$masks = array('1224*', '543*', '321*' ...);
$data = array('123456789', '123456788', '987654321' ....);

$matches = array();
foreach($masks as $mask) {
    $mask = substr($mask, 0, strlen($masks) - 2); // strip off trailing *
    $matches[$mask] = preg_grep("/^$mask/", $data);
}

No idea how efficient this would be, just offering it up as an alternative.


Although regex is not famous for being fast, I'd like to know how well preg_grep() can perform if the pattern is boiled down to its leanest form and only called once (not in a loop).

By removing longer masks which are covered by shorter masks, the pattern will be greatly reduced. How much will the reduction be? of course, I cannot say for sure, but with 17,000 masks, there are sure to be a fair amount of redundancy.

Code: (Demo)

$masks = ['1224*', '543*', '321*', '12245*', '5*', '122488*'];
sort($masks);
$needle = rtrim(array_shift($masks), '*');
$keep[] = $needle;
foreach ($masks as $mask) {
    if (strpos($mask, $needle) !== 0) {
        $needle = rtrim($mask, '*');
        $keep[] = $needle;
    }
}
// now $keep only contains: ['1224', '321', '5']

$numbers = ['122456789', '123456788', '321876543234567', '55555555555555555', '987654321'];

var_export(
    preg_grep('~^(?:' . implode('|', $keep) . ')~', $numbers)
);

Output:

array (
  0 => '122456789',
  2 => '321876543234567',
  3 => '55555555555555555',
)


Check out the PHP function array_intersect_key.

0

精彩评论

暂无评论...
验证码 换一张
取 消