I am receiving a document that claims to be UTF-8 (<?xml version="1.0" encoding="UTF-8"?>
). I've had some problems in the past where the encoding declaration from the sender has not been all that reliable (i.e. documents are declared to have a given encoding when in fact they do not), so I try to check using http://utf8checker.codeplex.com/ According to this tool, a 0xF8 byte means that this document is not UTF-8 encoded.
However, to the contrary, this page lists the Norwegian character 'ø' as being represented in UTF-8 as 0xF8. (The page is in Norwegian, however, the data I am referring to stems from the table at the bottom of the pag开发者_如何学JAVAe.)
Can anyone help me sort this out? I'm feeling rather confused here.
Thanks!
ø is U+00F8 and since it is not in ASCII it cannot be a single UTF-8 code unit. It is represented by 0xC3 0xB8 in UTF-8. Therefore, if you have 0xF8 standing alone in a document somewhere, yes, it is invalid UTF-8.
It seems that the document uses either Latin-1 or the Windows code page 1252.
I don't think that page is very reliable, it also says "UTF-8 = UCS-1".
Checking Wikipedia, F8 can only be used as the first byte of a 5 byte UTF-8 sequence, but currently no Unicode characters exist which would require 5 byte encoding. So no.
The utf8checker tool is right and the page you are referring to is wrong. The UTF-8 representation of 'ø' is 0xC3 0xB8 (two bytes).
http://www.fileformat.info/info/unicode/char/f8/index.htm
精彩评论