开发者

XML declaration encoding

开发者 https://www.devze.com 2023-04-07 11:52 出处:网络
What does it actually do? On my very basic level of understanding XML is just a formatted text. So there is no binary<->text transformation involved.

What does it actually do? On my very basic level of understanding XML is just a formatted text. So there is no binary<->text transformation involved.

I highly suspect that the only difference between UTF-8 and ASCII encoding is that ASCII encoding will make XML writer work harder by converting all the non-ASCII characters into XML entities as opposed to just reserved XML characters. So ASCII encoded XML can still contain UTF-8 characters, except it 开发者_C百科is going to be slightly longer and uglier.

Or is there some other function to it?

Update:

I perfectly understand how individual characters are converted into byte(s) by means of encoding. However XML is just text markup and at no point does that.

The question really is why XML encoding value is stored in the XML? Or what is the case where XML reader would need to know which encoding was used for any particular XML document?


See Appendix F in the XML specification, "Autodetection of Character Encodings".

In particular, "XML encoding value is stored in the XML" because, by default, XML processors must assume the content is in UTF-16 or UTF-8, in the absence of external metadata found outside of the XML document. The XML declaration is designed for such cases where such metadata is not present.

Another advantage to how XML handles encodings is that this way, an XML processor need support only two encodings, namely UTF-8 and UTF-16. If the processor discovers, either in external metadata or in the XML declaration, that the document is in an encoding it does not support, it can fail sooner than it would if it continues to read the document (long after the declaration) and encounters an unexpected byte sequence for the encoding detected using implementation-dependent heuristics.


I'd highly, HIGHLY recommend reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). You're saying XML is "just text" as if that makes everything simple, but even knowing that it's text as opposed to some structured binary format doesn't mean you know exactly how to read it or what characters are therein.

This isn't a "go read the manual!" answer, I believe establishing this baseline on how difficult text can be will help explain why the XML declaration exists.

why does XML declaration need encoding in the first place?

This is one of the ideas addressed in the article, but it's worth stressing here: All text has an encoding. There is no such thing as 'Plain Text'. ASCII is an encoding, even if we don't think about it most of the time. Historically we've often stuck our head in the sand and assumed everything is ASCII, but this isn't feasible in today's day & age. The XML declaration's encoding helps us out, where has a .txt file has nothing to indicate what its encoding is.


Yes, an XML file is a text file, i.e. a sequence of characters. A file is a sequence of bytes. So how are individual characters encoded, i.e. converted into a sequence of bytes? There are several ways to encode characters into bytes; the "encoding" declaration indicates which one is used.

As such, the "encoding" declaration plays a very significant role: one absolutely needs to know which encoding is used to be able to merely read the characters from a file. If no encoding is specified, XML has a set of default encodings, depending on the presence of a “byte order marker” (BOM). If there is no BOM, the default encoding is UTF-8.

ASCII is one of the simplest forms of encoding. It can only represent a span of 128 basic Latin characters. UTF-8 is more elaborate; it can represent all of the Unicode character set. So you're right, if you're using ASCII, you're obliged to use XML entities to represent the huge amount of characters that exist in Unicode but not in ASCII.

0

精彩评论

暂无评论...
验证码 换一张
取 消