开发者

XmlDocument.Load() method fails to decode € (euro)

开发者 https://www.devze.com 2023-01-29 11:06 出处:网络
I have an XML document file.xml which is encoded in Iso-latin-15 (aka Iso-Latin-9) <?xml version=\"1.0\" encoding=\"iso-8859-15\"?>

I have an XML document file.xml which is encoded in Iso-latin-15 (aka Iso-Latin-9)

<?xml version="1.0" encoding="iso-8859-15"?>
<root xmlns="http://stackoverflow.com/demo">
  <f>€.txt</f>
</root>

From my favorite text editor, I can t开发者_如何学Cell this file is correctly encoded in Iso-Latin-15 (it is not UTF-8).

My software is written in C# and wants to extract the element f.

XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("file.xml"); 

In real life, I have a XMLResolver to set credentials. But basically, my code is as simple as that. The loading goes smoothly, I don't have any exception raised.

Now, my problem when I extract the value:

//xnsm is the XmlNameSpace manager
XmlNode n = xmlDoc.SelectSingleNode("//root/f", xnsm); 
if (n != null)
  String filename = n.InnerText;

The Visual Studio debugger displays filename = □.txt

It could only be a Visual Studio bug. Unfortunately File.Exists(filename) returns false, whereas the file actually exist.

What's wrong?


If I remember correctly the XmlDocument.Load(string) method always assumes UTF-8, regardless of the XML encoding.

You would have to create a StreamReader with the correct encoding and use that as the parameter.

xmlDoc.Load(new StreamReader(
                     File.Open("file.xml"), 
                     Encoding.GetEncoding("iso-8859-15"))); 

EDIT:

I just stumbled across KB308061 from Microsoft. There's an interesting passage:

Specify the encoding declaration in the XML declaration section of the XML document. For example, the following declaration indicates that the document is in UTF-16 Unicode encoding format:

<?xml version="1.0" encoding="UTF-16"?>

Note that this declaration only specifies the encoding format of an XML document and does not modify or control the actual encoding format of the data.


Don't just use the debugger or the console to display the string as a string.

Instead, dump the contents of the string, one character at a time. For example:

foreach (char c in filename)
{
    Console.WriteLine("{0}: {1:x4}", c, (int) c);
}

That will show you the real contents of the string, in terms of Unicode code points, instead of being constrained by what the current font can display.

Use the Unicode code charts to look up the characters specified.


  1. Does your xml define its encoding correctly ? encoding="iso-8859-15" .. is that Iso-latin-15

  2. Ideally, you should put your content inside a CDATA element .. so the xml would look like <f><![CDATA[€.txt]]></f>

  3. Ideally, you should also escape all special characters with equivalent url-encoded (or http-encoded) values, because xml typically is for communicating through http.

I dont know the exact escape code for € .. but it would be something of this sort

<f><![CDATA[%3E.txt]]></f>

The above should make € be communicated correctly through the xml.

0

精彩评论

暂无评论...
验证码 换一张
取 消