So, I want to match the following link structures with a preg_match_all in php..
<a garbage href="http:/开发者_开发知识库/this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>
I can get " and ' deilmited urls one by doing
'#<a[^>]*?href=("|\')(.*?)("|\')#is'
or I can get all 3, but not if there are spaces in the first two with:
'#<a[^>]*?href=("|\')?(.*?)[\s\"\'>]#is'
How can I formulate this so that it will pick up " and ' delimited with potential spaces, but also properly encoded URLs without delimiters.
OK, this seems to work:
'#<a[^>]*?href=((["\'][^\'"]+["\'])|([^"\'\s>]+))#is'
($matches[1] contains the urls)
Only annoyance is that quoted urls have the quotes still on, so you'll have to strip them off:
$first = substr($match, 0, 1);
if($first == '"' || $first == "'")
$match = substr($match, 1, -1);
EDIT: I have edited this to work a little better than I originally posted.
You almost have it in the second regex:
'#<a[^>]*?href=("|\')?(.*?)[\\1|>]#is'
Returns the following array:
array(3) {
[0]=>
array(4) {
[0]=>
string(92) "<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>"
[1]=>
string(101) "<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>"
[2]=>
string(94) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>"
[3]=>
string(77) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>"
}
[1]=>
array(4) {
[0]=>
string(1) """
[1]=>
string(1) "'"
[2]=>
string(0) ""
[3]=>
string(0) ""
}
[2]=>
array(4) {
[0]=>
string(74) "http://this.is.a.link.com/?query=this has invalid spaces" possible garbage"
[1]=>
string(83) "http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage"
[2]=>
string(77) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage"
[3]=>
string(60) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters"
}
}
Works with or without delimiters.
Use a DOM parser. You cannot parse (x)HTML with regular expressions.
$html = <<<END
<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>
END;
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($html);
libxml_use_internal_errors(false);
$items = $domd->getElementsByTagName("a");
foreach ($items as $item) {
var_dump($item->getAttribute("href"));
}
When you say you want to match them, are you trying to extract information out of the links, or simply find hyperlinks with a href? If you're after only the latter, this should work just fine:
/<a[^>]*href=[^\s].*?>/
As @JasonWoof indicated, you need to use an embedded alternation: one alternative for quoted URLs, one for non-quoted. I also recommend using a capturing group to determine which kind of quote is being used, as @DanHorrigan did. With the addition of a negative lookahead ((?!\\2)
) and possessive quantifiers (*+
), you can create a highly robust regex that is also very quick:
~
<a\\s+[^>]*?\\bhref=
(
(["']) # capture the opening quote
(?:(?!\\2).)*+ # anything else, zero or more times
\\2 # match the closing quote
|
[^\\s>]*+ # anything but whitespace or closing brackets
)
~ix
See it in action on ideone. (The doubled backslashes are because the regex is written in the form of a PHP heredoc. I'd prefer to use a nowdoc, but ideone is apparently still running PHP 5.2.)
精彩评论