I have array like this:
array('1224*', '543*', '321*' ...)
which contains about 17,00 "masks" or prefixes.
I have a sec开发者_如何学Pythonond array:
array('123456789', '123456788', '987654321' ....)
which contain about 250,000 numbers.
Now, how can I efficiently match every number from the second array using the array of masks/prefixes?
[EDIT]
The first array contains only prefixes and every entry has only one *
at the end.
Well, here's a solution:
Prelimary steps:
- Sort array 1, cutting off the
*
's.
Searching:
- For each number in array 2 do
- Find the first and last entry in array 1 of which the first character matches that of
number
(binary search). - Do the same for the second character, this time searching not the whole array but between
first
andlast
(binary search). - Repeat 2 for the nth character until a string is found.
- Find the first and last entry in array 1 of which the first character matches that of
This should be O(k*n*log(n))
where n
is the average number length (in digits) and k
the number of numbers.
Basically this is a 1 dimensional Radix tree, for optimal performance you should implement it, but it can be quite hard.
My two cents....
$s = array('1234*', '543*', '321*');
$f = array('123456789', '123456788', '987654321');
foreach ($f as $haystack) {
echo $haystack."<br>";
foreach ($s as $needle) {
$needle = str_replace("*","",$needle);
echo $haystack "- ".$needle.": ".startsWith($haystack, $needle)."<br>";
}
}
function startsWith($haystack, $needle) {
$length = strlen($needle);
return (substr($haystack, 0, $length) === $needle);
}
To improve performance it might be a good idea to sort both arrays first and to add an exit clause in the inner foreach
loop.
By the way, the startWith
-function is from this great solution in SO: startsWith() and endsWith() functions in PHP
Another option would to be use preg_grep in a loop:
$masks = array('1224*', '543*', '321*' ...);
$data = array('123456789', '123456788', '987654321' ....);
$matches = array();
foreach($masks as $mask) {
$mask = substr($mask, 0, strlen($masks) - 2); // strip off trailing *
$matches[$mask] = preg_grep("/^$mask/", $data);
}
No idea how efficient this would be, just offering it up as an alternative.
Although regex is not famous for being fast, I'd like to know how well preg_grep()
can perform if the pattern is boiled down to its leanest form and only called once (not in a loop).
By removing longer masks which are covered by shorter masks, the pattern will be greatly reduced. How much will the reduction be? of course, I cannot say for sure, but with 17,000 masks, there are sure to be a fair amount of redundancy.
Code: (Demo)
$masks = ['1224*', '543*', '321*', '12245*', '5*', '122488*'];
sort($masks);
$needle = rtrim(array_shift($masks), '*');
$keep[] = $needle;
foreach ($masks as $mask) {
if (strpos($mask, $needle) !== 0) {
$needle = rtrim($mask, '*');
$keep[] = $needle;
}
}
// now $keep only contains: ['1224', '321', '5']
$numbers = ['122456789', '123456788', '321876543234567', '55555555555555555', '987654321'];
var_export(
preg_grep('~^(?:' . implode('|', $keep) . ')~', $numbers)
);
Output:
array (
0 => '122456789',
2 => '321876543234567',
3 => '55555555555555555',
)
Check out the PHP function array_intersect_key.
精彩评论