RFC 3986 states that new URI scheme should be encoded to UTF-8 first before being percent encoded. However, this does not apply to previous URI versions.
Is it safe to assume that all multibyte, percent encoded URI turns into UTF-8 encoded string after being passed through urldecode()
?
For example, if the contents of $_SERVER['REQUEST_URI']
is being percent encoded as such:
/b%C3%BCch/w%C3%B6rterb%C3%BCch
After I pass this string to urldecode()
, I should have a multibyte string. But how do I know in what encoding the string is? In the above example, it's UTF-8, but开发者_运维问答 is it safe to always assume so?
If it's not safe to assume so, is there a way (other than mb_detect_encoding
) to detect the encoding of the string? I've checked request headers, they don't seem to have anything helpful.
Thank you for all the comments and answers! I have done some digging myself after I posted the question and would like to write it down here as a reference. Please let me know if this answer is wrong.
Skip to the end to go directly to the conclusion.
From the JETTY Docs on International Characters and Character Encoding, from the section "International characters in URLs", I found these paragraphs:
Due to the lack of a standard, different browers took different approaches to the character encoding used. Some use the encoding of the page and some use UTF-8. Some drafts were prepared by various standards bodies suggesting that UTF-8 would become the standard encoding. Older versions of jetty (eg 4.0.x series) used UTF-8 as the default in anticipation of a standard being adopted. As a standard was not forthcoming, jetty-4.1.x reverted to a default encoding of ISO-8859-1.
The W3C organization's HTML standard now recommends the use of UTF-8: http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars and accordingly jetty-6 series uses a default of UTF-8.
On the linked HTML 4.0 spec, there is indeed a recommendation for clients to encode non-ASCII characters into UTF-8 first before percent-encoding it, so we know it has been a recommendation from W3C since HTML 4.0.
The example used on the page is this:
<A href="http://foo.org/Håkon">...</A>
While it later states that the same encoding should be applied to the fragment part, it doesn't say that if it also applies to query string.
Typing URLs into browsers
Firefox
As Pekka already mentioned, based on this link Firefox sends ISO-8859-1 encoded URI as late as 2007. Reading the link, this seems to be the default behavior for Firefox < 3.0. I'm not sure if this also applies to Firefox < 3.0 in Mac OS X, since default encoding in Mac is UTF-8.
I've tested Firefox 3.6.13 in Windows XP and Firefox 6 in both Windows 7 and Mac OS X. The Mac version sends everything in UTF-8, so it's nothing to worry about.
Firefox 3.6.13 and 6 in windows encodes query strings into ISO-8859-1 by default, but when you type characters that doesn't exist in ISO-8859-1 to the query string (α, for example), Firefox 3 switches the encoding of the entire query string to UTF-8. I'm pretty sure this is the same behavior in later versions too.
In Firefox 3.6.13 and 6 in Windows that I tested, the path part of the URI is always encoded as UTF-8.
If you type this URL to Firefox 3.6/6 in Windows:
http://localhost/test/ü/ä/index.php?chär=ü
The query string gets encoded as ISO-8859-1, but the 'path' part gets encoded as UTF-8:
http://localhost//test/%C3%BC/%C3%A4/index.php?ch%E4r=%FC
Also to be noted, according to this blog post, Firefox 3.0
converts katanaka character ア into ア
before percent-encoding
it. When I tried to do this in Firefox 3.6.13 in the query string
and the path, the katanaka character gets encoded in UTF-8 correctly.
Opera
Opera 10.10 on Mac encodes the query string part of the URI into ISO-8859-1, even though the default encoding for Mac OS X is UTF-8. The 'path' part gets encoded into UTF-8, just like Firefox.
If you try to type greek alphabet α to the query string it gets sent as a question mark.
The same behavior is exhibited by Opera 11.51 in Windows XP.
Safari
Safari 5.1 on Mac always sends everything as UTF-8. Safari 5.1 in Windows exhibit the same behavior.
Chrome
Version 13 on Windows encodes both query string and path as UTF-8. I don't have Chrome on Mac, but it seems safe to assume that Chrome always sends UTF-8, like Safari.
Internet Explorer
DISCLAIMER: I use IECollection to install multiple versions of IE in one machine, so this may not be IE's natural behavior (anyone can confirm on this?).
IE 6, 7, and 8 in Windows XP encodes 'path' part of the URI into UTF-8 correctly. Umlauts and greek alphabet typed to the query string does not get percent encoded though. The query string typed to the address bar seems to be sent in ISO-8859-1, the greek alphabet alpha 'α' in the query string gets transliterated into 'a'.
Conclusion
This is short and incomplete, and I cannot guarantee the correctness of it, but it seems that the most common encodings for URIs are either ISO-8859-1 and UTF-8 (I have no idea what east asians use as their encoding, and it is too exhaustive for me to try and find out).
Since it is already a recommendation from HTML 4.0, I guess it's safe to assume the 'path' part of the URI is always encoded in UTF-8. Firefox 2.0 might still be around, so you must check if the encoding is ISO-8859-1 too. If it's not UTF-8 or ISO-8859-1, most likely it's a bad request.
It's theoretically impossible to correctly detect the encoding of of a string (see here, and here). You can guess, but you can get the wrong result. So don't rely on encoding detection.
Safe Multibyte Routing
The safest way is just to choose one encoding (UTF-8 is the safest bet) for your entire application. Then you have to:
- Make sure that all your strings are encoded in UTF-8 before using it to build your URI. Properly percent encode your URI after that.
- Make sure all your URL encoded (GET) forms sends their data in the proper encoding. See this FAQ by Kore Nordmann for more information about making sure your forms send the correct encoding.
Also see this great answer from bobince.
After this, you shouldn't have any problems parsing the URI. If the encoding is not in UTF-8, then it's a bad request, and you can respond with 404 or 400 page.
Since it is unsafe to assume that anyway ("bad guys don't care"), you can use mb_check_encoding
to test for UTF-8 string. UTF has a structure that has a low probability to be conformed to in a string in another encoding.
You don't know. It depends on the person/code that generated the URI.
精彩评论