开发者

C# char/byte encoding equality

开发者 https://www.devze.com 2023-01-19 17:16 出处:网络
I have some code to dump strings to stdout to check their encoding, it looks like this: private void DumpString(string s)

I have some code to dump strings to stdout to check their encoding, it looks like this:

    private void DumpString(string s)
    {   
        System.Console.Write("{0}: ", s);
        foreach (byte b in s)
        {   
            System.Console.Write("{0}({1}) ", (char)b, b.ToString("x2"));
        }       
        System.Console.WriteLine();
    }

Consider two strings, each of which appear as "ë", but with different encodings. DumpString will produce the following output:

ë: e(65)(08)

ë: ë(eb)

The code looks like this:

DumpString(string1);
DumpString(string2);

How can I convert string2, using the System.Text.Encoding, to be byte equivalen开发者_如何学Got to string1.


They don't have different encodings. Strings in C# are always UTF-16 (thus, you shouldn't use byte to iterate over strings because you'll lose the top 8 bits). What they have is different normalization forms.

Your first string is "\u0065\u0308": LATIN SMALL LETTER E + COMBINING DIAERESIS. This is the decomposed form (NFD).

The second is "\u00EB": LATIN SMALL LETTER E WITH DIAERESIS. This is the precomposed form (NFC).

You can convert between them with string.Normalize.


You're looking for the String.Normalize method.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号