开发者

Regexp matching mismatched html

开发者 https://www.devze.com 2023-02-27 07:05 出处:网络
How do I parse a certain link style out of html without it spreading across multiple lin开发者_StackOverflow中文版ks to match?

How do I parse a certain link style out of html without it spreading across multiple lin开发者_StackOverflow中文版ks to match?

The exact link I am trying to match is:

href="http://www.hotmail.com' rel='external nofollow"

Pay particular attention to the mismatching of ' and " in the above.

What I have tried:

if(preg_match('|href="http(.*?)\' rel=\'(.*?)"|i', $html)){
  echo "Found bad html\n";
}

However that regexp is also matching in perfectly good html across several links. I need to be able to only match within a single link.


You might be able to adapt your regex by replacing the generic .*? with a negative character class like [^<"'>]+. That usually prevents that it eats up too much.

if(preg_match('| href="(http[^<"\'>]+)\' rel=\'([^<"\'>]+)"|i', $html)){

Better yet: don't hard-code the " and ', but use a character class to match them too:

if(preg_match('| href=["\']http([^<"\'>]+)["\']'
              .' rel=["\']([^<"\'>]*)["\']|i', $html)){

(Oh, now it looks really ugly.)

0

精彩评论

暂无评论...
验证码 换一张
取 消