How can i count the number of words between two words?
$txt = "tükörfúrógép banana orange lime, tükörfúrógép cherry árvíztűrő orange lyon
cat lime mac tükörfúrógép cat orange lime cat árvíztűrő
tükörfúrógép banana orange lime
orange lime cat árvíztűrő";
The two words: 'árvíztűrő' and 'tükörfúrógép'
I need this return: tükörfúrógép cherry árvíztűrő tükörfúrógép cat orange lime cat árvíztűrő tükörfúrógép banana orange lime orange lime cat árvíztűrőNow i have this regular expression:
pr开发者_运维问答eg_match_all('@((tükörfúrógép(.*)?árvíztűrő)(árvíztűrő(.*)?tükörfúrógép))@sui',$txt,$m);
I have several things to point out:
- You can't do it in one regex. Regex is forward-only, reversed match order requires a second regex.
- You use
(.*)?
, but you mean(.*?)
- To aquire correct matches, you must ensure that the left boundary of your expression cannot occur in the middle.
You should denote word boundaries (EDIT: While this is correct in theory, it does not work for Unicode input in PHP.\b
) around your delimiter words to ensure whole-word matches.You should switch the PHP locale to Hungarian (it is Hungarian, right?) before callingEDIT: The meaning ofpreg_match_all()
, because the locale has an influence on what's considered a word boundary in PHP.\b
does in fact not change with the selected locale.
That being said, regex #1 is:
(\btükörfúrógép\b)((?:(?!\1).)*?)\bárvíztűrő\b
and regex #2 is analoguous, just with reversed delimiter words.
Regex explanation:
( # match group 1:
\b # a word boundary
tükörfúrógép # your first delimiter word
\b # a word boundary
) # end match group 1
( # match group 2:
(?: # non-capturing group:
(?! # look-ahead:
\1 # must not be followed by delimiter word 1
) # end look-ahead
. # match any next char (includes \n with the "s" switch)
)*? # end non-capturing group, repeat as often as necessary
) # end match group 2 (this is the one you look for)
\b # a word boundary
árvíztűrő # your second delimiter word
\b # a word boundary
UPDATE: With PHP's patheticpoor Unicode string support, you will be forced to use expressions like these as replacements for \b
:
$before = '(?<=^|[^\p{L}])';
$after = '(?=[^\p{L}]|$)';
This suggestion has been taken from another question.
To count words between two words you can easily use:
count(split(" ", "lime orange banana"));
And a function that returns an array with matches and counts will be:
function count_between_words($text, $first, $second, $case_sensitive = false)
{
if(!preg_match_all('/('.$first.')((?:(?!\\1).)*?)'.$second.'/s' . ($case_sensitive ? "" : "i"), preg_replace("/\\s+/", " ", $text), $results, PREG_SET_ORDER))
return array();
$data = array();
foreach($results as $result)
{
$result[2] = trim($result[2]);
$data[] = array("match" => $result[0], "words" => $result[2], "count" => count(split(" ", $result[2])));
}
return $data;
}
$result = count_between_words($txt, "tükörfúrógép", "árvíztűrő");
echo "<pre>" . print_r($result, true) . "</pre>";
Result will be:
Array
(
[0] => Array
(
[match] => tükörfúrógép cherry árvíztűrő
[words] => cherry
[count] => 1
)
[1] => Array
(
[match] => tükörfúrógép cat orange lime cat árvíztűrő
[words] => cat orange lime cat
[count] => 4
)
[2] => Array
(
[match] => tükörfúrógép banana orange lime orange lime cat árvíztűrő
[words] => banana orange lime orange lime cat
[count] => 6
)
)
Instead of a huge, confusing regexp, why not write a few lines using various string functions?
Example:
$start = strpos($txt, 'árvíztűrő') + 9; // position of first char after 'árvíztűrő'
$end = strpos($txt, 'tükörfúrógép', $start);
$inner = substr($txt, $start, $end - $start);
$words = preg_split("/[\s,]+/", $inner);
$num = count($words);
Of course, this will eat up memory if you have some gigantic input string...
精彩评论