开发者

RegEx not working with look-aheads!

开发者 https://www.devze.com 2023-01-07 23:36 出处:网络
Hey guys, I am trying to match \"address\" in this page - http://www.bbb.org/norfolk/business-reviews/tax-return-preparation/liberty-tax-service-in-virginia-beach-va-48000604

Hey guys, I am trying to match "address" in this page -

http://www.bbb.org/norfolk/business-reviews/tax-return-preparation/liberty-tax-service-in-virginia-beach-va-48000604

The source of address part has this HTML

<tr>
    <td align="right" class="generalinfo_left">Address:</td>
    <td class="generalinfo_right">1 S Main St Ste 1430<br /></td>
</tr>
<tr>
    <td align="right" class="generalinfo_left"></td>
    <td class="genera开发者_运维知识库linfo_right">Dayton, OH 45402</td>
</tr>

So, I tried the following RegEx in PHP.

"%Address:</td>(.*?)(?!<br />)</td>%s"

where "s" is the modifier for "." to match new lines too. But it is not working. It doesnt matches the "Dayton, OH 45402" part. Can anyone tell me why?


Please don't try to parse HTML with regular expressions, it invokes the wrath of Zalgo.

Try using the DOM and xpath to target the specific elements and attributes you are attempting to extract.

(I'd provide an xpath example, but it's still on my to-learn list... :) )


It's pretty normal: If you look at your sample text, you will see that between Address and Dayton, OH 45402, you have <br />. (?!<br />) specifically states that it should not match if <br /> is found.

You should use parser for HTML.

That said, assuming that all your files are exactly like this sample, this ugly regex should work:

%(Address:)(.*?generalinfo_right">)(.*?)((<br />)|(</td>))(.*?generalinfo_right">)(.*?)((<br />)|(</td>))%s

Groups 1, 3 and 8 contain the address.

However, since most likely your documents are not all exactly like that, a much better solution will be to parse HTML with a proper parser.


The .*? goes all the way to the end of the <br />. Then, the next text is "</td>", so the lookahead fails and the match succeeds, with the capture being, "<td class="generalinfo_right">1 S Main St Ste 1430<br />". In other words, the lookahead doesn't prevent the match because it's too late.

There are ways to write it correctly (e.g. you could explicitly add the <tr> and then <td class="generalinfo_right">. However, Charles is right that you should use a real parser.

0

精彩评论

暂无评论...
验证码 换一张
取 消