开发者

Is 0xF8 a valid byte in a UTF-8 encoded XML document?

开发者 https://www.devze.com 2023-02-06 17:45 出处:网络
I am receiving a document that claims to be UTF-8 (<?xml version=\"1.0\" encoding=\"UTF-8\"?>). I\'ve had some problems in the past where the encoding declaration from the sender has not been al

I am receiving a document that claims to be UTF-8 (<?xml version="1.0" encoding="UTF-8"?>). I've had some problems in the past where the encoding declaration from the sender has not been all that reliable (i.e. documents are declared to have a given encoding when in fact they do not), so I try to check using http://utf8checker.codeplex.com/ According to this tool, a 0xF8 byte means that this document is not UTF-8 encoded.

However, to the contrary, this page lists the Norwegian character 'ø' as being represented in UTF-8 as 0xF8. (The page is in Norwegian, however, the data I am referring to stems from the table at the bottom of the pag开发者_如何学JAVAe.)

Can anyone help me sort this out? I'm feeling rather confused here.

Thanks!


ø is U+00F8 and since it is not in ASCII it cannot be a single UTF-8 code unit. It is represented by 0xC3 0xB8 in UTF-8. Therefore, if you have 0xF8 standing alone in a document somewhere, yes, it is invalid UTF-8.

It seems that the document uses either Latin-1 or the Windows code page 1252.


I don't think that page is very reliable, it also says "UTF-8 = UCS-1".

Checking Wikipedia, F8 can only be used as the first byte of a 5 byte UTF-8 sequence, but currently no Unicode characters exist which would require 5 byte encoding. So no.


The utf8checker tool is right and the page you are referring to is wrong. The UTF-8 representation of 'ø' is 0xC3 0xB8 (two bytes).

http://www.fileformat.info/info/unicode/char/f8/index.htm

0

精彩评论

暂无评论...
验证码 换一张
取 消