What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?
I am aware of the fact that this开发者_StackOverflow中文版 is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.
I think you need to perform Unicode Normalization. on your string.
I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:
Unicode normalization form KC, compatibility composition. Transforms each base plus combining characters to the canonical precomposed equivalent and all compatibility characters to their equivalents. For example, the ligature fi becomes f + i; similarly, A + ¨ + fi + n becomes Ä + f + i + n.
Here is the complete code that solved my problem:
function Unescape(const s: AnsiString): string; var i: Integer; j: Integer; c: Integer; begin // Make result at least large enough. This prevents too many reallocs SetLength(Result, Length(s)); i := 1; j := 1; while i <= Length(s) do begin if s[i] = '\' then begin if i < Length(s) then begin // escaped backslash? if s[i + 1] = '\' then begin Result[j] := '\'; inc(i, 2); end // convert hex number to WideChar else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s)) and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin inc(i, 6); Result[j] := WideChar(c); end else begin raise Exception.CreateFmt('Invalid code at position %d', [i]); end; end else begin raise Exception.Create('Unexpected end of string'); end; end else begin Result[j] := WideChar(s[i]); inc(i); end; inc(j); end; // Trim result in case we reserved too much space SetLength(Result, j - 1); end; const NormalizationC = 1; function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer; lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll'; function Normalize(const s: string): string; var newLength: integer; begin // in NormalizationC mode the result string won't grow longer than the input string SetLength(Result, Length(s)); newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result)); SetLength(Result, newLength); end; function UnescapeAndNormalize(const s: AnsiString): string; begin Result := Normalize(Unescape(s)); end;
Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)
Are they always escaped like this? Always in a number of 4 digits?
How is the \ character itself escaped?
Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:
function Unescape(s: AnsiString): WideString;
i: Integer;
j: Integer;
c: Integer;
// Make result at least large enough. This prevents too many reallocs
SetLength(Result, Length(s));
i := 1; j := 1;
while i <= Length(s) do
// If a '\' is found, typecast the following 4 digit integer to widechar
if s[i] = '\' then
if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
raise Exception.CreateFmt('Invalid code at position %d', [i]);
Inc(i, 6);
Result[j] := WideChar(c);
Result[j] := WideChar(s[i]);
// Trim result in case we reserved too much space
SetLength(Result, j-1);
Use like this
MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);
This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.
[edit] Code is ok. Highlighter fails.
If I'm not mistaken, Delphi XE now supports regular expressions. I don't use them that often, though, but it seems a good way to parse the string and then replace all escaped values. Maybe someone has a good example of how to do this in Delphi with regular expressions?
GolezTrol, you forget '$'
if (s[i+1] <> 'u') or not TryStrToInt('$'+Copy(s, i+2, 4), c) then