How did SourceForge maim this Unicode character?_问答_开发者

How did SourceForge maim this Unicode character?

开发者 https://www.devze.com 2023-02-16 08:34 出处：网络

A little encoding puzzle for you. A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.

A little encoding puzzle for you.

A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.

In the XML export, however, it shows up as:

&#226;&#8364;&#8221;

Decoding the entities, that results in these code points:

U+00E2 U+20AC U+201D

I.e. the characters â€”. The XML should开发者_高级运维 have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.

Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?

The the XML output is incorrectly been encoded using CP1252. To revert this, convert â€” to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.

Java based evidence:

String s = "â€”";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —

Note that this assumes that the stdout console uses by itself UTF-8 to display the character.

In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("â€”")) returns —.

SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.