RegEx.Replace but exclude matches within html tags?_问答_开发者

I have a helper method called HighlightKeywords, which I use on a Forum when viewing search results, to highlight the keyword(s) within the posts, that the user has searched on.

The problem I have is that, say for example the user searches for the keyword 'hotmail', where the HighlightKeywords method then finds matches of that keyword and wraps it with a span tag specifying a style to apply, it's finding matches within html anchor tags and in some cases image tags. As a result, when I render the highlighted posts to screen, the html tags are broken (due to the span being inserted within them).

Here is my function:

public static string HighlightKeywords(this string s, string keywords, string cssClassName)
    {
        if (s == string.Empty || keywords == string.Empty)
        {
            return s;
        }

        string[] sKeywords = keywords.Split(' ');
        foreach (string sKeyword in sKeywords)
        {
            try
            {
                s = Regex.Replace(s, @"\b" + sKeyword + @"\b", string.Format("<span class=\"" + cssClassName + "\">{0}</span>", "$0"), RegexOptions.IgnoreCase);
            }
            catch {}
        }
        return s;
    }

What would be the best way to prevent this from breaking? Even if I could just simply exclude any matches that occur within anchor tags (whether they be web or em开发者_如何学Cail addresses) or image tags?

No. You can't do that. At least, not in a way that won't break. Regular Expressions are not up to the task of parsing HTML. I am really sorry. You will want to read this rant too: RegEx match open tags except XHTML self-contained tags

So, you will probably need to parse the HTML (I hear the HtmlAgilityPack is good) and then only match inside certain portions of the document - excluding anchor tags etc.

I ran into the same problem, came up with this work around

    public static string HighlightKeyWords(string s, string[] KeyWords)
    {
        if (KeyWords != null && KeyWords.Count() > 0 && !string.IsNullOrEmpty(s))
        {
            foreach (string word in KeyWords)
            {
                s = System.Text.RegularExpressions.Regex.Replace(s, word, string.Format("{0}", "{0}$0{1}"), System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            }
        }

        s = string.Format(s, "<mark class='hightlight_text_colour'>", "</mark>");

        return s;
    }

Looks kind of scary, but I delay the adding of the html tags until the regex expression has matched all the keywords, adding in the {0} and {1} place holders for the begging and end html tags, instead of the tags. I then add the html tags in at the end, using the place holders from inside the loop.

Would still break if the keyword of {0} or {1} is passed in as a keyword though.

Marcus, resurrecting this question because it had a simple solution that wasn't mentioned. This situation sounds very similar to Match (or replace) a pattern except in situations s1, s2, s3 etc.

With all the disclaimers about using regex to parse html, here is a simple way to do it.

Taking hotmail as an example to show the technique in its simplest form, here's our simple regex:

<a.*?</a>|(hotmail)

The left side of the alternation matches complete <a ... </a> tags. We will ignore these matches. The right side matches and captures hotmail to Group 1, and we know they are the right hotmail because they were not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main() {
var myRegex = new Regex(@"<a.*?</a>|(hotmail)");
string s1 = @"replace this=> hotmail not that => <a href=""http://hotmail.com"">hotmail</a>";

string replaced = myRegex.Replace(s1, delegate(Match m) {
if (m.Groups[1].Value != "") return "<span something>hotmail</span>";
else return m.Value;
});
Console.WriteLine("\n" + "*** Replacements ***");
Console.WriteLine(replaced);


Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();

} // END Main
} // END Program

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...