Edit:
I ended up using CakePHP's truncate()
function. It's much faster and supports unicode :D
But the question still remains:
How can I make the function auto-detect full 开发者_StackOverflow中文版stops (.
) and cut it just after that? So basically $length
would be semi-ignored. So if the new text would have a incomplete sentence, more words would appended until the sentence finishes (Or removed, depending on the length of the string from the cut-off until next/previous sentence)
Edit 2: I found out how to detect full stops. I replaced:
if (!$exact) {
$spacepos = mb_strrpos($truncate, ' ');
...
with
if (!$exact) {
$spacepos = mb_strrpos($truncate, '.');
...
edit - problem:
When I have tags like img
that have dots inside their attributes, the text gets cutoff inside the tag:
$text = '<p>Abc def abc def abc def abc def. Abc def <img src="test.jpg" /></p><p>abc def abc def abc def abc def.</p>';
echo htmlentities(truncate($text));
How can I fix that? I'll open a bounty because the original question was already answered...
This snippet solves what you're looking for, and lists it's failures (full stops may not indicate sentence ends, and other punctuation can end sentences).
It will scan characters up to $maxLen
and then effectively 'throw away' a partial sentence after the last full stop it finds.
In your case, you'd use this function just before you return $new_text
.
To resolve the "full-stop in tag" issue, you can use something similar to the following to detect if the stop is within a tag:
$str_len = strlen($summary);
$pos_stop = strrpos($summary, '.');
$pos_tag_open = strrpos($summary, '<', -($str_len - $pos_stop));
$pos_tag_close = strpos($summary, '>', $pos_tag_open);
if (($pos_tag_open < $pos_stop) && ($pos_stop < $pos_tag_close)) {
// Inside tag! Search for the next nearest prior full-stop.
$pos_stop = strrpos($summary, '.', -($str_len - $pos_tag_open));
}
echo htmlentities(substr($summary, 0, $pos_stop + 1));
Obviously, this code can be optimized (and pulled out into its own function), but you get the idea. I have a feeling there is a regex that might handle this a bit more efficiently.
Edit:
Indeed, there is a regex that can do this, using negative lookahead:
$text = '<p>Abc def abc def abc def abc def. Abc def <img src="test.jpg" />abc</p>';
$count = preg_match_all("/\.(?!([^<]+)?>)/", $text, $arr, PREG_OFFSET_CAPTURE);
$offset = $arr[0][$count-1][1];
echo substr($text, 0, $offset + 1)."\n";
This should be relatively efficient, at least in comparison with truncate()
which also uses preg_match internally.
The regular expression above Truncate html text while taking in consideration "full stops" (in CachePHP TextHelper->truncate) might work.
But, considering the efficiency, in this case, we might truncate the string to a max_length first, and then do the preg to the truncated string. And yes, considerations must be done to the punctuation symbols.
Some more rules will create a proper logic to determine the ending of a sentence.
- a space or a EOL after the punctuation symbol
- First word after the picked punctuation, having an Upper case.
- Multiple new lines (end of paragraph) after the punctuation symbol, etc.
精彩评论