I'm trying to find a single regular expression that I can use to parse a block of HTML to find some specific text, but only if that text is not part of an existing hyperlink. I want to turn the non-links into links, which is easy, but identifying the non-linked ones with a single expression seems more troublesome. In the following example:
This problem is a result of BugID 12.
If you want more information, refer to <a href="/bug.aspx?id=12">BugID 12</a>.
I want a single expression to find "BugID 12" so I can link it, but I don't want to match the second one because it's already linked.
In case it matters开发者_C百科, I'm using .NET's regular expressions.
Don't do it! See Jeff Atwood's Parsing Html The Cthulhu Way!
If .Net supports negative look aheads (which I think it does):
(BugID 12)(?!</a>) // match BugID 12 if it is not followed by a closing anchor tag.
However, there is still the danger that BugID 12 will be inside an anchor like
<a href="...">Something BugID 12 Something</a>
But you can mostly overcome this with
(BugID 12)(?!(?:\s*\w*)*</a>) // (?:\s*\w*)* matches any word characters or spaces between the string and the end tag.
Disclaimer: Parsing html with regex is not reliable and should only be done as a last resort, or in the most simple of cases. I'm sure there are plenty of instances where the above expression does not perform as desired. (example: BugID 12</span></a>
)
精彩评论