C++ encode string to Unicode - ICU library_问答_开发者

C++ encode string to Unicode - ICU library

开发者 https://www.devze.com 2023-01-16 02:57 出处：网络

I need to convert a bunch of bytes in ISO-2022-JP and ISO-2022-JP-2 (and other variations of ISO-2022) into Unicode.I am trying to use ICU (link text), but the following code doesn\'t work.

I need to convert a bunch of bytes in ISO-2022-JP and ISO-2022-JP-2 (and other variations of ISO-2022) into Unicode. I am trying to use ICU (link text), but the following code doesn't work.

std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7";    //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false;   //couldn't find character set

UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length

// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);

This doesn't work. The result contains '?' charcters for anything I put in that was above ASCII. The status has no error. What am I doing wrong?

On top of that I was having trouble compiling the library ver 4.4 as the MSVC 9 project would not convert to MSVC 10 project.

I am also aware of libiconv open source library. I couldn't compile that one on windows. If anyone has any advice on a different library, that's also welcome.

Thanks.

EDIT The escape sequence I originally used was wrong. So now ICU takes the开发者_如何学Go string, strips out the escape sequence - which is a step in the right direction. But the result still contains '?' chars.

EDIT2 The reason I couldn't convert to MSVC 10 project was because x64 platform wasn't installed (it isn't by default). Alternatively I could open all the projects in text editor and remove all mention of x64 target.

This doesn't resemble an ISO 2022 encoding. The high bits are supposed to be zero. The escape sequence looks somewhat recognizable, but it starts with ESC. 0x1b, not 0xb0. No idea what those byte values really mean.

(This question looks familiar, Hi again.)

A minor, minor nit: You want to check the error status with if(U_FAILURE(status)) (or conversely, U_SUCCESS(status)).

I couldn't get the conversion to work for JIS_X201 character set in ISO-2022-JP encoding. And I couldn't generate a "valid" one using any tools at my disposal - tried Java (ICU and non ICU implementation of ISO2022) and C++.

So I basically just wrote a function to do a code lookup and convert to Unicode using this table: wikipedia.

EDIT As I started filling out the bug report I wanted to include the RFC for ISO-2022-JP. Then I found this line in the RFC "The Kana set of JIS X 0201 is not used in ISO-2022-JP messages." link text. So it appears that the standard doesn't actually define the upper bits. The ISO-2022-JP-3 WILL map the upper bits, but to lower plane. So I have to take each byte and subtract 0x80 from it, and pass it through ISO-2022-JP-3, and take the other bytes < 128 and pass them through ISO-2022-JP converter for full JIS_X201 character set. Well it's a lot easier to just do it myself.

So strictly speaking I would say it's not a bug. It's a huge headache though.

P.S. the whole messed up stream that I'm trying to decode comes from DICOM. See pdf page 107 to see what they consider acceptable.