Truncate a multibyte String to n chars_问答_开发者

I am trying to get this method in a String Filter working:

public function truncate($string, $chars = 50, $terminator = ' …');

I'd expect this

$in  = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890";
$out = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …";

and also this

$in  = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎď开发者_运维问答ĐđĒēĔĕĖėĘęĚěĜĝ";
$out = "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …";

That is $chars minus the chars of the $terminator string.

In addition, the filter is supposed to cut at the first word boundary below the $chars limit, e.g.

$in  = "Answer to the Ultimate Question of Life, the Universe, and Everything.";
$out = "Answer to the Ultimate Question of Life, the …";

I am pretty certain this should work with these steps

substract amount of chars in terminator from maximum chars
validate that string is longer than the calculated limit or return it unaltered
find the last space character in string below calculated limit to get word boundary
cut string at last space or calculated limit if no last space is found
append terminator to string
return string

However, I have tried various combinations of str* and mb_* functions now, but all yielded wrong results. This can't be so difficult, so I am obviously missing something. Would someone share a working implementation for this or point me to a resource where I can finally understand how to do it.

Thanks

P.S. Yes, I have checked https://stackoverflow.com/search?q=truncate+string+php before :)

Just found out PHP already has a multibyte truncate with

mb_strimwidth — Get truncated string with specified width

It doesn't obey word boundaries though. But handy nonetheless!

Try this:

function truncate($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

But you need to make sure that your internal encoding is properly set.

I don't usually like to just code an entire answer to a question like this. But also I just woke up, and I thought maybe your question would get me in a good mood to go program for the rest of the day.

I didn't try to run this, but it should work or at least get you 90% of the way there.

function truncate( $string, $chars = 50, $terminate = ' ...' )
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

tldr;

Strings that are sufficiently short should not be appended with ellipsis.
Newline characters should be qualifying breakpoints also.
Regex, once broken down and explained, is not too scary.

I think there are some important things to point out regarding this question and the current battery of answers. I'll demo a comparison of the answers plus my regex answer based on Gordon's sample data and some additional cases to expose some differing results.

First, to clarify the quality of the input values. Gordon says that the function needs to be multi-byte safe and respect word boundaries. The sample data doesn't expose the desired treatment of non-space, non-word characters (e.g. punctuation) in determining the truncation position, so we must assume that targeting whitespace characters is sufficient -- and sensibly so since most "read more" strings don't tend to worry about respecting punctuation when truncating.

Second, there are rather common cases where it is necessary to apply an ellipsis to a large body of text that contains newline characters.

Third, let's just arbitrarily agree to some basic standardizing of data such as:

Strings are already trimmed of all leading/trailing white space characters
The value of $chars will always be greater than the mb_strlen() of $terminator

(Demo)

Functions:

function truncateGumbo($string, $chars = 50, $terminator = ' …') {
    $cutPos = $chars - mb_strlen($terminator);
    $boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
    return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

function truncateGordon($string, $chars = 50, $terminator = ' …') {
    return mb_strimwidth($string, 0, $chars, $terminator);
}

function truncateSoapBox($string, $chars = 50, $terminate = ' …')
{
    $chars -= mb_strlen($terminate);
    if ( $chars <= 0 )
        return $terminate;

    $string = mb_substr($string, 0, $chars);
    $space = mb_strrpos($string, ' ');

    if ($space < mb_strlen($string) / 2)
        return $string . $terminate;
    else
        return mb_substr($string, 0, $space) . $terminate;
}

function truncateMickmackusa($string, $max = 50, $terminator = ' …') {
    $trunc = $max - mb_strlen($terminator, 'UTF-8');
    return preg_replace("~(?=.{{$max}})(?:\S{{$trunc}}|.{0,$trunc}(?=\s))\K.+~us", $terminator, $string);
}

Test Cases:

$tests = [
    [
        'testCase' => "Answer to the Ultimate Question of Life, the Universe, and Everything.",
        // 50th char ---------------------------------------------------^
        'expected' => "Answer to the Ultimate Question of Life, the …",
    ],
    [
        'testCase' => "A single line of text to be followed by another\nline of text",
        // 50th char ----------------------------------------------------^
        'expected' => "A single line of text to be followed by another …",
    ],
    [
        'testCase' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ",
        // 50th char ---------------------------------------------------^
        'expected' => "âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …",
    ],
    [
        'testCase' => "123456789 123456789 123456789 123456789 123456789",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "123456789 123456789 123456789 123456789 123456789",
    ],
    [
        'testCase' => "Hello worldly world",
        // 50th char doesn't exist -------------------------------------^
        'expected' => "Hello worldly world",
    ],
    [
        'testCase' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890",
        // 50th char ---------------------------------------------------^
        'expected' => "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …",
    ],
];

Execution:

foreach ($tests as ['testCase' => $testCase, 'expected' => $expected]) {
    echo "\tSample Input:\t\t$testCase\n";
    echo "\n\ttruncateGumbo:\t\t" , truncateGumbo($testCase);
    echo "\n\ttruncateGordon:\t\t" , truncateGordon($testCase);
    echo "\n\ttruncateSoapBox:\t" , truncateSoapBox($testCase);
    echo "\n\ttruncateMickmackusa:\t" , truncateMickmackusa($testCase);
    echo "\n\tExpected Result:\t{$expected}";
    echo "\n-----------------------------------------------------\n";
}

Output:

    Sample Input:           Answer to the Ultimate Question of Life, the Universe, and Everything.

    truncateGumbo:          Answer to the Ultimate Question of Life, the …
    truncateGordon:         Answer to the Ultimate Question of Life, the Uni …
    truncateSoapBox:        Answer to the Ultimate Question of Life, the …
    truncateMickmackusa:    Answer to the Ultimate Question of Life, the …
    Expected Result:        Answer to the Ultimate Question of Life, the …
-----------------------------------------------------
    Sample Input:           A single line of text to be followed by another
line of text

    truncateGumbo:          A single line of text to be followed by …
    truncateGordon:         A single line of text to be followed by another
 …
    truncateSoapBox:        A single line of text to be followed by …
    truncateMickmackusa:    A single line of text to be followed by another …
    Expected Result:        A single line of text to be followed by another …
-----------------------------------------------------
    Sample Input:           âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝ

    truncateGumbo:          âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateGordon:         âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateSoapBox:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    truncateMickmackusa:    âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
    Expected Result:        âãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđ …
-----------------------------------------------------
    Sample Input:           123456789 123456789 123456789 123456789 123456789

    truncateGumbo:          123456789 123456789 123456789 123456789 12345678 …
    truncateGordon:         123456789 123456789 123456789 123456789 123456789
    truncateSoapBox:        123456789 123456789 123456789 123456789 …
    truncateMickmackusa:    123456789 123456789 123456789 123456789 123456789
    Expected Result:        123456789 123456789 123456789 123456789 123456789
-----------------------------------------------------
    Sample Input:           Hello worldly world

    truncateGumbo:          
Warning: mb_strpos(): Offset not contained in string in /in/ibFH5 on line 4
Hello worldly world …
    truncateGordon:         Hello worldly world
    truncateSoapBox:        Hello worldly …
    truncateMickmackusa:    Hello worldly world
    Expected Result:        Hello worldly world
-----------------------------------------------------
    Sample Input:           abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ1234567890

    truncateGumbo:          abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateGordon:         abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateSoapBox:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    truncateMickmackusa:    abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
    Expected Result:        abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV …
-----------------------------------------------------

My pattern explanation:

Though it does look rather unsightly, most of garbled pattern syntax is a matter of inserting the numeric values as dynamic quantifiers.

I could have also written it as:

'~(?:\S{' . $trunc . '}|(?=.{' . $max . '}).{0,' . $trunc . '}(?=\s))\K.+~us'

For simplicity, I'll replace $trunc with 48 and $max with 50.

~                 #opening pattern delimiter
(?=.{50})         #lookahead to ensure that the string has a minimum of 50 characters
(?:               #start of non-capturing group -- to maintain pattern logic only
  \S{48}          #the string starts with at least 48 non-white-space characters
  |               #or
  .{0,48}(?=\s)   #the string starts with upto 48 characters followed by a whitespace
)                 #end of non-capturing group
\K                #restart the fullstring match (aka "forget" the previously matched characters)
.+                #match the remaining characters (these characters will be replaced)
~                 #closing pattern delimiter
us                #pattern modifiers: unicode/multibyte flag & dot matches newlines flag