I'm trying to match attributes from a html tag, but I can't get it working :)
Let's take this tag for example:
<a href="ddd" class='sw ' w'>
Obviously the last part is not quite right.
Now I tried to match the attributes part with this piece of code:
preg_match('/(\s+\w+=(?P<quote>(\'|\"))[^(?P=quote)]*(?P=quote))*/U', " href=\"bla\" class='sw'sw'", $a);
Here $a is empty, and that's what I expect. But if I now take my complete expression it does match the last class part, which puzzles me. It looks like this:
preg_match('/<(?P<c>[\/]?)(?P<tag>\w+)(?P<atts>(\s+\w+=(?P<quote>(\'|\"))[^(?P=quote)]*(?P=quote))*)\s*(?P<sc>[\/]?开发者_运维知识库)>/U', $tag, $a);
Now $a returns:
Array
(
[0] => <a href="ddd" class='sw ' w'>
[c] =>
[1] =>
[tag] => a
[2] => a
[atts] => href="ddd" class='sw ' w'
[3] => href="ddd" class='sw ' w'
[4] => class='sw ' w'
[quote] => '
[5] => '
[6] => '
[sc] =>
[7] =>
)
Notice the key 4 which contains the class part including the last 'w, while I did use the (U)ngreedy switch at the end.
Any clues?
It's really a bad idea to try and regex HTML - there is a DOM Inspector for PHP that can do this.
[^(?P=quote)]
You can't do that. Character classes only contain single characters, backslash-escapes and -
ranges; this character class matches any of the literal characters (
, )
, ?
, P
and so on.
Moreover, (?P=quote)
is not a backreference, it's a recursive expression. It takes the regex from the earlier definition:
(?P<quote>(\'|\"))
and so matches either ' or " regardless of which quote was used at the start of the attribute value. Backrefs are done with expressions like \1
matching the numbered ()
match group.
But anyway, squeeks is right: parsing [X][HT]ML with regex is a total losing game. You will never come up with an expression that treats all possible markup correctly. Stop wasting your time and use an XML or HTML parser.
精彩评论