Regular expression to strip everything between anchor tags_问答_开发者

Regular expression to strip everything between anchor tags

开发者 https://www.devze.com 2022-12-17 22:51 出处：网络

I am trying to strip out all the links and text between anchors tags from a html string as below: string LINK_TAG_PATTERN = \"/<a\\b[^>]*>(.*?)<\\\\/a>\";

I am trying to strip out all the links and text between anchors tags from a html string as below:

 string LINK_TAG_PATTERN = "/<a\b[^>]*>(.*?)<\\/a>";

 htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty);

This is not working anyone have ideas why?

Thanks a lot,

Edit: the regex was from this link Extract text and lin开发者_如何学运维ks from HTML using Regular Expressions

Problems in your string: Unnecessary slash at the beginning (that's Perl syntax), unescaped backslash (\b), unnecessary escaped backslash (\\).

So, if it has to be a Regex, taking all warnings into account that enough other people have linked to, try

string LINK_TAG_PATTERN = @"<a\b[^>]*>(.*?)</a>";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty, RegexOptions.IgnoreCase);

The \b is necessary to prevent other tags that start with a from matching.

Use an HTML Parser and not Regular Expressions to parse HTML.

HTML Agiliity Pack

I recommend Expresso to troubleshoot regular expressions. You can find a library of regular expressions here.

You might consider using javascript to walk the DOM tree for your replacements instead of regex.

string LINK_TAG_PATTERN = @"(<a\s+[^>]*>)(.*?)(</a>)";

htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, "$1$3", RegexOptions.IgnoreCase);

Conceptually, this only strips links of a very special kind (e.g. your regex does not match upper-case A which is perfectly valid in HTML: <A ...>bla</A>. The replacement wouldn't work for javascript links either. Is your code relevant to user security?