I'm working on a small C++ app that does a bit of string handling. Currently, I want to get extract a string at a particular character index. My naive solution of using string's at() method works fine, but it breaks for non-ascii strings. For example:
string test = "ヘ(^_^ヘ)(ノ^_^)ノ"
cout << test.at(0) << endl;
Produces a pound sign as output for me under gcc 4.2. I don't think it's a problem with my terminal either, because I can print out the entire string just fine. Is there a library or something I could use to get the desired effect开发者_如何学编程?
string
uses char
s which are only 8 bits. You need to use wstring if you want to encode 16-bit characters.
Your string is probably UTF-8, where "characters" and "bytes" are not the same thing. The std::string
class assumes "characters" are one byte each, so the results are wrong.
Your options are to convert the string to UTF-16 and use a wstring
instead, where you can (generally) assume that characters are all two bytes (a wchar_t
or short
) each, or you can use a library like ICU or UTF8-CPP to operate on UTF-8 strings directly, doing things like "get the 3rd character" rather than "get the 3rd byte".
Or, if you want to go minimalist, you could just code up a (relatively) simple function to get the byte offset and length of a particular character by reusing the internals of one of the UTF-8 string-length functions from one of the libraries listed above or from google. Basically you have to inspect each character and jump ahead 1-3 bytes to get to the start of the next character depending on what bits are set.
Here's one that could be easily translated from PHP:
for($i = 0; $i < strlen($str); $i++) {
$value = ord($str[$i]);
if($value > 127) {
if($value >= 192 && $value <= 223)
$i++;
elseif($value >= 224 && $value <= 239)
$i = $i + 2;
elseif($value >= 240 && $value <= 247)
$i = $i + 3;
else
die('Not a UTF-8 compatible string');
}
$count++;
}
http://www.php.net/manual/en/function.strlen.php#25715
精彩评论