I was wondering what does the following sentence mean in si开发者_JAVA技巧mple terms for us dummies?
And what is byte sequence? And how many characters in a byte?
iconv_strlen() counts the occurrences of characters in the given byte sequence str on the basis of the specified character set, the result of which is not necessarily identical to the length of the string in byte.
Let's take for example the Japanese character 'こ'. Assuming UTF-8 encoding, this is a 3 byte character (0xE3 0x81 0x93). Let's see what happens when we use strlen
instead:
$ php -r 'echo strlen("こ") . "\n";'
3
The result is 3, since strlen
is counting bytes. However, this is only a single character according to UTF-8 encoding. That's where iconv_strlen
comes in. It knows that in UTF-8, this is a single character, even though it's made up of 3 bytes. So if we try this instead:
$ php -r 'echo iconv_strlen("こ", "UTF-8") . "\n";'
1
We get 1. That's what that explanation is meant to point out.
"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
A string has a particular length in bytes. The number of characters in that string will be equal to the number of bytes if and only if each character in the string is represented by a single byte. This is true, for example, for English letters. For representations (i.e., encodings) that use more than one byte to represent some or all characters, the number of characters will be less than the number of bytes*. It is not possible, for example, to represent all possible Chinese characters with a byte.
So, iconv_strlen, given an encoding, will try to count the number of characters in the string. The byte sequence is the order of bytes in the string. For a string containing Chinese, using UTF8 encoding, you might, for example, have a 20-byte string that has 14 characters.
*It could be more, if a character is represented by less than one byte.
iconv_strlen()
counts the occurrences of characters in the given byte sequencestr
on the basis of the specified character set, the result of which is not necessarily identical to the length of the string in byte.
Translations:
byte sequence
: another word for string, which is a sequence of bytes (1 byte = 8 bits), e.g.:01011010 00011001 01101011
. Byte sequences represent characters likeA
,B
,C
etc.character set
: a.k.a. encoding, specifies how a byte maps to a character; e.g.01000001
representsA
in the ASCII character set.not necessarily identical to the length […] in byte
: in the ASCII character set, one byte represents exactly one character. This is not the case for all character sets; in some two, three or more bytes are used to represent one character. That is because one byte can only hold 256 different values and some languages are written using more than 256 characters (like Chinese and Japanese). Unicode even attempts to map all characters of all human languages in a single character set, which requires a lot more than one byte per character.
In summary:
iconv_strlen()
counts the characters in the given string, taking into account the character set. Therefore, the number of characters may not be equal to the number of bytes.
精彩评论