How to iterate a UTF-8 string character by character using indexing?
When you access a UTF-8 string with the bracket operator $str[0]
the utf-encoded character consists of 2 or more elements.
For example:
$str = "Kąt";
$str[0] = "K";
$str[1] = "�";
$str[2] = "�";
$str[3] = "t";
but I would like to have:
$str[0] = "K";
$str开发者_开发技巧[1] = "ą";
$str[2] = "t";
It is possible with mb_substr
but this is extremely slow, ie.
mb_substr($str, 0, 1) = "K"
mb_substr($str, 1, 1) = "ą"
mb_substr($str, 2, 1) = "t"
Is there another way to interate the string character by character without using mb_substr
?
Use preg_split. With "u" modifier it supports UTF-8 unicode.
$chrArray = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
Preg split will fail over very large strings with a memory exception and mb_substr is slow indeed, so here is a simple, and effective code, which I'm sure, that you could use:
function nextchar($string, &$pointer){
if(!isset($string[$pointer])) return false;
$char = ord($string[$pointer]);
if($char < 128){
return $string[$pointer++];
}else{
if($char < 224){
$bytes = 2;
}elseif($char < 240){
$bytes = 3;
}else{
$bytes = 4;
}
$str = substr($string, $pointer, $bytes);
$pointer += $bytes;
return $str;
}
}
This I used for looping through a multibyte string char by char and if I change it to the code below, the performance difference is huge:
function nextchar($string, &$pointer){
if(!isset($string[$pointer])) return false;
return mb_substr($string, $pointer++, 1, 'UTF-8');
}
Using it to loop a string for 10000 times with the code below produced a 3 second runtime for the first code and 13 seconds for the second code:
function microtime_float(){
list($usec, $sec) = explode(' ', microtime());
return ((float)$usec + (float)$sec);
}
$source = 'árvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógépárvíztűrő tükörfúrógép';
$t = Array(
0 => microtime_float()
);
for($i = 0; $i < 10000; $i++){
$pointer = 0;
while(($chr = nextchar($source, $pointer)) !== false){
//echo $chr;
}
}
$t[] = microtime_float();
echo $t[1] - $t[0].PHP_EOL.PHP_EOL;
In answer to comments posted by @Pekla and @Col. Shrapnel I have compared preg_split
with mb_substr
.
The image shows, that preg_split
took 1.2s, while mb_substr
almost 25s.
Here is the code of the functions:
function split_preg($str){
return preg_split('//u', $str, -1);
}
function split_mb($str){
$length = mb_strlen($str);
$chars = array();
for ($i=0; $i<$length; $i++){
$chars[] = mb_substr($str, $i, 1);
}
$chars[] = "";
return $chars;
}
Using Lajos Meszaros' wonderful function as inspiration I created a multi-byte string iterator class.
// Multi-Byte String iterator class
class MbStrIterator implements Iterator
{
private $iPos = 0;
private $iSize = 0;
private $sStr = null;
// Constructor
public function __construct(/*string*/ $str)
{
// Save the string
$this->sStr = $str;
// Calculate the size of the current character
$this->calculateSize();
}
// Calculate size
private function calculateSize() {
// If we're done already
if(!isset($this->sStr[$this->iPos])) {
return;
}
// Get the character at the current position
$iChar = ord($this->sStr[$this->iPos]);
// If it's a single byte, set it to one
if($iChar < 128) {
$this->iSize = 1;
}
// Else, it's multi-byte
else {
// Figure out how long it is
if($iChar < 224) {
$this->iSize = 2;
} else if($iChar < 240){
$this->iSize = 3;
} else if($iChar < 248){
$this->iSize = 4;
} else if($iChar == 252){
$this->iSize = 5;
} else {
$this->iSize = 6;
}
}
}
// Current
public function current() {
// If we're done
if(!isset($this->sStr[$this->iPos])) {
return false;
}
// Else if we have one byte
else if($this->iSize == 1) {
return $this->sStr[$this->iPos];
}
// Else, it's multi-byte
else {
return substr($this->sStr, $this->iPos, $this->iSize);
}
}
// Key
public function key()
{
// Return the current position
return $this->iPos;
}
// Next
public function next()
{
// Increment the position by the current size and then recalculate
$this->iPos += $this->iSize;
$this->calculateSize();
}
// Rewind
public function rewind()
{
// Reset the position and size
$this->iPos = 0;
$this->calculateSize();
}
// Valid
public function valid()
{
// Return if the current position is valid
return isset($this->sStr[$this->iPos]);
}
}
It can be used like so
foreach(new MbStrIterator("Kąt") as $c) {
echo "{$c}\n";
}
Which will output
K
ą
t
Or if you really want to know the position of the start byte as well
foreach(new MbStrIterator("Kąt") as $i => $c) {
echo "{$i}: {$c}\n";
}
Which will output
0: K
1: ą
3: t
You could parse each byte of the string and determine whether it is a single (ASCII) character or the start of a multi-byte character:
The UTF-8 encoding is variable-width, with each character represented by 1 to 4 bytes. Each byte has 0–4 leading consecutive '1' bits followed by a '0' bit to indicate its type. 2 or more '1' bits indicates the first byte in a sequence of that many bytes.
you would walk through the string and, instead of increasing the position by 1, read the current character in full and then increase the position by the length that character had.
The Wikipedia article has the interpretation table for each character [retrieved 2010-10-01]:
0-127 Single-byte encoding (compatible with US-ASCII)
128-191 Second, third, or fourth byte of a multi-byte sequence
192-193 Overlong encoding: start of 2-byte sequence,
but would encode a code point ≤ 127
........
I had the same issue as OP and I try to avoid regex in PHP since it fails or even crashes with long strings. I used Mészáros Lajos' answer with some changes since I have mbstring.func_overload
set to 7.
function nextchar($string, &$pointer, &$asciiPointer){
if(!isset($string[$asciiPointer])) return false;
$char = ord($string[$asciiPointer]);
if($char < 128){
$pointer++;
return $string[$asciiPointer++];
}else{
if($char < 224){
$bytes = 2;
}elseif($char < 240){
$bytes = 3;
}elseif($char < 248){
$bytes = 4;
}elseif($char = 252){
$bytes = 5;
}else{
$bytes = 6;
}
$str = substr($string, $pointer++, 1);
$asciiPointer+= $bytes;
return $str;
}
}
With mbstring.func_overload
set to 7, substr
actually calls mb_substr
. So substr
gets the right value in this case. I had to add a second pointer. One keeps track of the multi-byte char in the string, the other keeps track of the single-byte char. The multi-byte value is used for substr
(since it's actually mb_substr
), while the single-byte value is used for retrieving the byte in this fashion: $string[$index]
.
Obviously if PHP ever decides to fix the [] access to work properly with multi-byte values, this will fail. But also, this fix wouldn't be needed in the first place.
I think the most efficient solution would be to work through the string using mb_substr. In each iteration of the loop, mb_substr would be called twice (to find the next character and the remaining string). It would pass only the remaining string to the next iteration. This way, the main overhead in each iteration would be finding the next character (done twice), which takes only one to five or so operations, depending on the byte length of the character.
If this description is not clear, let me know and I'll provide a working PHP function.
Since PHP 7.4 You can use mb_str_split
.
https://www.php.net/manual/en/function.mb-str-split.php
$str = 'Kąt';
$chars = mb_str_split($str);
var_dump($chars);
array(3) {
[0] =>
string(1) "K"
[1] =>
string(2) "ą"
[2] =>
string(1) "t"
}
精彩评论