开发者

Storing UTF-8 string in a UnicodeString

开发者 https://www.devze.com 2022-12-28 10:33 出处:网络
In Delphi 2007 you can store a UTF-8 string in a WideString and then pass that onto a Win32 function, e.g.

In Delphi 2007 you can store a UTF-8 string in a WideString and then pass that onto a Win32 function, e.g.

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

Delphi 2007 does not interfere with the contents of UTF8Str, i.e. it is left as a UTF-8 encoded string stored in a WideString.

But in Delphi 2010 I'm struggling to find a way to do the same thing, i.e. store a UTF-8 encoded string in a WideString without it being automatically converted from UTF-8. I cannot pass a pointer to a UTF-8 string (or RawByteString), e.g. the following will obviously not work:

var
  UnicodeStr: WideString;
  UTF8Str: UTF8String;
begin
  UnicodeStr:='some unicode开发者_如何学Go text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;


Your original Delphi 2007 code was converting the UTF-8 string to a widestring using the ANSI codepage. To do the same thing in Delphi 2010 you should use SetCodePage with the Convert parameter false.

var
  UnicodeStr: UnicodeString;
  UTF8Str: RawByteString;
begin
  UTF8Str := UTF8Encode('some unicode text');
  SetCodePage(UTF8Str, 0, False);
  UnicodeStr := UTF8Str;
  Windows.SomeFunction(PWideChar(UnicodeStr), ...)


Hmm, why are you doing that? Why are you encoding a WideString to UTF-8 just to store it again back to WideString. You are obviously using a Unicode version of the Windows API. So there is no need to use a UTF-8-encoded string. Or am I missing something.

Because Windows API functions are either Unicode (two bytes) or ANSI (one byte). UTF-8 would be wrong choice here, because mainly it contains one byte per character, but for characters above the ASCII base it uses two or more bytes.

Otherwise the equivalent for your old code in unicode Delphi would be:

var
  UnicodeStr: string;
  UTF8Str: string;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

WideString and string (UnicodeString) are similar, but the new UnicodeString is faster because it is reference-counted and WideString is not.

You code was not correct because the UTF-8 string has a variable number of bytes per character. "A" is stored as one byte. Just an ASCII byte code. "ü" on the other hand would be stored as two bytes. And because you are then using PWideChar the function always expects two bytes per character.

There is another difference. In older Delphi versions (ANSI) Utf8String was just an AnsiString. In Unicode versions of Delphi Utf8String is a string with a UTF-8 code page behind it. So it behaves differently.

The old code would still work correctly:

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

It would act the same as it did in Delphi 2007. So maybe you have a problem elsewhere.

Mick you are correct. The compiler does some extra work behind the scenes. So in order to avoid this you can do something like this:

var
  UTF8Str: AnsiString;
  UnicodeStr: WideString;
  TempString: RawByteString;
  ResultString: WideString;
begin
  UnicodeStr := 'some unicode text';
  TempString := UTF8Encode(UnicodeStr);
  SetLength(UTF8Str, Length(TempString));
  Move(TempString[1], UTF8Str[1], Length(UTF8Str));
  ResultString := UTF8Str;
end;

I checked, and it works just the same. Because I move bytes directly in memory there is no codepage conversion done in the background. I am sure it can be done with greater eleganece, but the point is that I see this as the way for what you want to achieve.


Which Windows API call wants you to pass a UTF-8 string? It is either an ANSI string or a Widestring (A or W functions). Widestrings have two bytes per character, and UTF-8 strings have one (or more if you beyond the first 128 ASCII characters).

UTF-8 in an Widestring just doesn't make sense. When there is really a Windows function that wants a pointer to an UTF-8 string, you probably have to cast is to a PAnsiChar.

0

精彩评论

暂无评论...
验证码 换一张
取 消