开发者

Regular expression to find text not part of a hyperlink

开发者 https://www.devze.com 2022-12-22 15:31 出处:网络
I\'m trying to find a single regular expression that I can use to parse a block of HTML to find some specific text, but only if that text is not part of an existing hyperlink.I want to turn the non-li

I'm trying to find a single regular expression that I can use to parse a block of HTML to find some specific text, but only if that text is not part of an existing hyperlink. I want to turn the non-links into links, which is easy, but identifying the non-linked ones with a single expression seems more troublesome. In the following example:

  This problem is a result of BugID 12.
  If you want more information, refer to <a href="/bug.aspx?id=12">BugID 12</a>.

I want a single expression to find "BugID 12" so I can link it, but I don't want to match the second one because it's already linked.

In case it matters开发者_C百科, I'm using .NET's regular expressions.


Don't do it! See Jeff Atwood's Parsing Html The Cthulhu Way!


If .Net supports negative look aheads (which I think it does):

(BugID 12)(?!</a>)  // match BugID 12 if it is not followed by a closing anchor tag.

However, there is still the danger that BugID 12 will be inside an anchor like

<a href="...">Something BugID 12 Something</a>

But you can mostly overcome this with

(BugID 12)(?!(?:\s*\w*)*</a>)  // (?:\s*\w*)* matches any word characters or spaces between the string and the end tag.

Disclaimer: Parsing html with regex is not reliable and should only be done as a last resort, or in the most simple of cases. I'm sure there are plenty of instances where the above expression does not perform as desired. (example: BugID 12</span></a>)

0

精彩评论

暂无评论...
验证码 换一张
取 消