I am writing a hand-coded CSS 2.1 parsing engine (in C#), and I'm working directly off the W3C CSS 2.1 grammar (http://www.w3.org/TR/CS开发者_运维问答S21/grammar.html). However, there's a token that I just don't quite get:
url ([!#$%&*-~]|{nonascii}|{escape})*
...
"url("{w}{url}{w}")" {return URI;}
"url("{w}{string}{w}")" {return URI;}
I don't get what the URL production is supposed to do. It appears to be a string of only !#$%&*-~
, non-ascii, or escaped unicode code points. How is that a URL? Is this production just really badly named, and what purpose is it supposed to serve?
Any help appreciated. FYI, I've added the C# tag only to increase the audience to actual programmers who might have encountered this or have insights - I apologize if you think I shouldn't apply.
Dude, did you read the CONTEXT surrounding that expression?
baduri1 url\({w}([!#$%&*-\[\]-~]|{nonascii}|{escape})*{w}
baduri2 url\({w}{string}{w}
baduri3 url\({w}{badstring}
Hmmm... Bad, bad, bad. Bit of a giveaway, eh what? Generally, If something in the doco doesn't make sense to you, or appears just plain wrong, maybe it shouldn't make sense? Yes? So you read around it... to acquire the correct context.
[!#$%&*-~]
breaks down to:
!
, #
, $
, %
, &
, plus the character range *
- ~
.
This takes in most printable ASCII characters, including uppercase, lowercase, digits and a range of punctuation characters.
It's easier to list the printable ASCII characters which this regex doesn't match:
Double quote "
, single quote '
, and parenthesis (
, )
; i.e printable ascii characters minus delimiters. This makes it possible to parse urls that do not include quotation marks. E.g. url(http://example.com)
, instead of url("http://example.com")
.
Concise, but tricky!
P.S. The token name is confusing as well. A better name would have been something like: url_string
or url_arg
.
EDIT Feb 2015 The latest CSS3 Syntax Spec names the token url-unquoted
I don't get what the URL production is supposed to do. It appears to be a string of only !#$%&*-~, non-ascii, or escaped unicode code points. How is that a URL? Is this production just really badly named, and what purpose is it supposed to serve?
The first line defines url
as a regular expression:
url ([!#$%&*-~]|{nonascii}|{escape})*
The second line defines URI
as a token which can be produced/returned by the lexer:
"url("{w}{url}{w}")" {return URI;}
The second line says that if the lexer sees url(
then {w}
then {url}
then {w}
then )
then it has found a URI.
The {w}
expression is optional whitespace.
So according to the definition, the {url}
is a regular expression: which defines what characters are allow inside a URI
token, between the initial url(
and the final )
.
精彩评论