开发者

Non-ASCII String Character Index in C++

开发者 https://www.devze.com 2022-12-12 13:39 出处:网络
I\'m working on a small C++ app that does a bit of string handling. Currently, I want to get extract a string at a particular character index. My naive solution of using string\'s at() method works fi

I'm working on a small C++ app that does a bit of string handling. Currently, I want to get extract a string at a particular character index. My naive solution of using string's at() method works fine, but it breaks for non-ascii strings. For example:

string test = "ヘ(^_^ヘ)(ノ^_^)ノ"
cout << test.at(0) << endl;

Produces a pound sign as output for me under gcc 4.2. I don't think it's a problem with my terminal either, because I can print out the entire string just fine. Is there a library or something I could use to get the desired effect开发者_如何学编程?


string uses chars which are only 8 bits. You need to use wstring if you want to encode 16-bit characters.


Your string is probably UTF-8, where "characters" and "bytes" are not the same thing. The std::string class assumes "characters" are one byte each, so the results are wrong.

Your options are to convert the string to UTF-16 and use a wstring instead, where you can (generally) assume that characters are all two bytes (a wchar_t or short) each, or you can use a library like ICU or UTF8-CPP to operate on UTF-8 strings directly, doing things like "get the 3rd character" rather than "get the 3rd byte".

Or, if you want to go minimalist, you could just code up a (relatively) simple function to get the byte offset and length of a particular character by reusing the internals of one of the UTF-8 string-length functions from one of the libraries listed above or from google. Basically you have to inspect each character and jump ahead 1-3 bytes to get to the start of the next character depending on what bits are set.

Here's one that could be easily translated from PHP:

for($i = 0; $i < strlen($str); $i++) {
    $value = ord($str[$i]);
    if($value > 127) {
        if($value >= 192 && $value <= 223)
            $i++;
        elseif($value >= 224 && $value <= 239)
            $i = $i + 2;
        elseif($value >= 240 && $value <= 247)
            $i = $i + 3;
        else
            die('Not a UTF-8 compatible string');
        }
    $count++;
} 

http://www.php.net/manual/en/function.strlen.php#25715

0

精彩评论

暂无评论...
验证码 换一张
取 消