I'm having real trouble understanding the specification and guidelines on how to properly escape and encode a URL for submission in a sitemap.
In the sitemap.org (entity escaping) examples, they have an example URL:
http://www.example.com/ümlat.php&q=name
Which when UTF-8 encoded ends up as (according to them):
http://www.example.com/%C3%BCmlat.php&q=name
However, when I try this (rawurlencode) on PHP I end up with:
http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname
I've sort of beaten this by using this function found on PHP.net
$entities = array('%21开发者_如何学编程', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40',
'%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
"$", ",", "/", "?", "#", "[", "]");
$string = str_replace($entities, $replacements, rawurlencode($string));
but according to someone I spoke to (Kohana BDFM), this interpretation is wrong. Honestly, I'm so confused I don't even know what's right.
What's the correct way to encode a URL for use in the sitemap?
Relevant RFC 3986
The problem is that http://www.example.com/ümlat.php&q=name
is not a valid url.
(source: RFC 1738, which is obsolete but serves its purpose here, RFC 3986 indeed allows more characters, but no harm is done by escaping characters that don't need escaping)
httpurl = "http://" hostport [ "/" hpath [ "?" search ]] hpath = hsegment *[ "/" hsegment ] hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ] uchar = unreserved | escape unreserved = alpha | digit | safe | extra safe = "$" | "-" | "_" | "." | "+" extra = "!" | "*" | "'" | "(" | ")" | "," escape = "%" hex hex search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
So any character except ;:@&=$-_.+!*'(),
, a 0-9a-zA-Z
character or an escape sequence (e.g. %A0
or, equivalently, %a0
) must be escaped. The ?
character can appear at most once. The /
character can appear in the path portion, but not in the query string. The convention for encoding the other characters is to compute their UTF-8 representation and escape that sequence.
Your algorithm should (assuming the host part is not a problem...):
- extract the path part
- extract the query string part
- for each of those, look for invalid characters
- encode those characters in UTF-8
- pass the result to
rawurlencode
- replace the character in the URL with the result of
rawurlencode
精彩评论