Foreign characters in website_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-05 22:53 出处：网络

I found a website that contains the string \"donâ€™t\". The obvious intent was the word \"don\'t\". I looked at the source expecting to see some character references, but didn\'t (it just shows the

I found a website that contains the string "donâ€™t". The obvious intent was the word "don't". I looked at the source expecting to see some character references, but didn't (it just shows the literal string "donâ€™t". A Google search yielded nothing (expe开发者_开发问答ct lots of other sites that have the same problem!). Can anyone explain what's happening here?

Edit: Here's the meta tag that was used:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Would this not cause the page to be served up as Latin-1 in the HTTP header?

In your browser, switch the page encoding to "UTF-8". You're seeing a right single quote character, which is encoded by the octets 0xE2 0x80 0x99 in UTF-8. In your charset, windows-1252, those 3 octets render as "â€™". The page should be explicitly specifying UTF-8 as its charset either in the HTTP headers or in an HTML <meta> tag, but it probably isn't.

According to Character encondings in HTML a lemme in wikipedia:

HTML (Hypertext Markup Language) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII two goals are worth considering: the information's integrity, and universal browser display.

I suppose the site you checked, isn't impelemented with this in mind.

This has all got to do with encoding. Take a look back at the source, is there a tag at the top specifying it (charset)? My guess is it'll be UTF8 - although it could be something completely different.

This thread explains all. A combination of using a weird UTF-8 apostrophe character (probably originating from a Word Document), on a server that probably reports its encoding as non-UTF-8, despite the page having UTF characters (and possible even correctly reporting its own encoding).