RegEx to determine the longest "part" of a phrase, with specified delimiters?
News stories almost always have this sort of structure, where the is actually the title plus a bunch of garbage. Is there a way to RegEx out all the garbage and maintain the longest part of the title, obviously this would require using delimiters such as |
, -
, :
, etc...
Here are some examples
eBand |
Jornali开发者_运维技巧smo |
Saúde |
Alimentos em conserva podem causar botulismo; saiba como evitar a doença
Obama calls for wide-range immigration reform in El Paso -
San Jose Mercury News
CL +
Suspensa produção de mortadela com toucinho, suspeita de contaminação
BBC News -
John Kerry to travel to Pakistan amid strained ties
Not with the regex itself I think. But you can split up the title on the "garbage" characters, and then sort by length of the remaining parts.
$parts = preg_split('#\s*[-|:+]+\s*#', $title);
$parts = array_combine($parts, array_map("strlen", $parts));
arsort($parts);
$longest = current(array_keys($parts));
Instead of specific delimiters, you could also split on non-word symbols \W
(or [^\pL]
with /u Unicode flag).
I don't think it can be done in pure regular expressions but you can use preg_split and iterate over the results:
$pieces = preg_split('/[|-:]/', $headline, PREG_SPLIT_NO_EMPTY);
$max_len = 0;
$result = '';
foreach ($pieces as $piece) {
$len = strlen($piece);
if ($len > $max_len) {
$max_len = $len;
$result = $piece;
}
}
Or use array_reduce
function longest($v, $w) {
if (strlen($w) > strlen($v)) {
return $w;
}
return $v;
}
$pieces = preg_split('/[|-:]/', $headline, PREG_SPLIT_NO_EMPTY);
$result = array_reduce($pieces, 'longest');
精彩评论