开发者

Calculating the length of a Japanese multibyte string with half-width kana in PHP

开发者 https://www.devze.com 2023-02-25 10:44 出处:网络
So I have a UTF-8 encoded string which can contain full-width kanji, full-width kana, half-width kana, romaji, numbers or kawaii japanese symbols like ★ or♥.

So I have a UTF-8 encoded string which can contain full-width kanji, full-width kana, half-width kana, romaji, numbers or kawaii japanese symbols like ★ or ♥.

If I want the length I use mb_strlen() and it counts each of these as 1 in length. Which is fine for most 开发者_如何转开发purposes.

But, I've been asked (by a Japanese client) to only count half-width kana as 0.5 (for the purpose of maxlength of a text field) because apparently thats how Japanese websites do it. I do this using mb_strwidth() which counts full-width as 2, and half-width as 1, then i just divide by 2.

However this method also counts romaji characters as 1 so something like Chocアイス would count as 7 .. then i'd divide by 2 to account for kanji and I'd get 3.5. but I actually want 5.5 (4 for the Romaji + 1.5 for the 3 half-width kana).

// EDIT: some more info: any character (even non-kana) which has both a full and a half should be 1 for the full-width and 0.5 for the half-width. for example, characters like ¥、3@( should all be 1, but characters like ¥,3@( should all be 0.5

// EXTRA EDIT: symbols like ☆ and ♥ should be 1, but the mb_strwidth/2 method return them as 0.5

Is there a standard way that Japanese systems count string length? Or does everyone just loop thru their strings and count the characters which don't match the standard width rules?


One way is to convert the half-width katakana to full-width and subtract the difference in width from the original length:

$raw = 'Chocアイス';
$full = mb_convert_kana($raw, 'K');
$len = mb_strlen($raw) - (mb_strwidth($full) - mb_strwidth($raw))/2;
assert($len === 5.5);

However, are you sure that you should be considering basic latin characters as full-width? There do exist full-width varieties of basic latin characters too---that is, should Choc be considered the same as Choc?

Usually, characters like "A" and "ア" would have a width of 1, but "A" and "ア" would have a width of 2 (which is what mb_strwidth does). I'd be cautious about having to hack around that.


Given your edit, mb_strwidth (or mb_strwidth/2) does exactly what you want.


So, I found no answer for this.

I fixed it by literally iterating thru and checking each character and manually applying the counting rules that my client asked for.


Look at Perl’s Unicode::GCString module: it give the correct columns for all Unicode, including the East Asian stuff.

It is an underlying component of Unicode::LineBreak, which I have found absolutely indispensable for doing proper text segmentation of Asian scripts.

As you might well imagine, both are Made in Japan™. :)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号