开发者

What is a regular expression and C# code to strip any html tag except links?

开发者 https://www.devze.com 2023-01-11 20:01 出处:网络
I\'m creating a CLR user defined function in Sql Server 2005 to do some cleaning in a lot of database tables.

I'm creating a CLR user defined function in Sql Server 2005 to do some cleaning in a lot of database tables.

The task is to remove almost all tags except links ('a' tags and their 'href' attributes). So I divided the problem in two stages. 1. creating a user defined sql server function, and 2. creating a sql server script to do the update to all the involved tables calling the clr function.

For the user defined function and given the restricted environment, I prefer to do this with native libraries. That means, not using the Html Agility Pack, for example.

In javascript this regular expression, apparently does the right job:

 <\s*a[^>]\s*href=(.*)>(.*?)<\s*/\开发者_JAVA百科s*a>

At least, according to http://www.pagecolumn.com/tool/regtest.htm

But, I don't know how to translate that (especially, the capturing groups part) into C# code to use the text as part of the output.

For instance, if the input is : <a href="http://example.com">some text</a> how to save the text "http://example.com" and "some text" as part of the output in C# code and at the same time stripping any other possible html tag (and their content)?


Your regular expression is completely wrong:

<\s*a[^>]\s*href=(.*)>(.*?)<\s*/\s*a>
      ↑            ↑
      1.           2.
  1. This causes <aa..., <ab..., <ac... etc. to match too.
  2. This causes you to overmatch. For example, consider this input:

    <a href='/one'>One</a> <a href='/two'>Two</a>
            ├───────────────────────────┤ ├─┤
                       group 1            grp2
    


Not quite as bomb-proof as Jordan's, but an example using Matches instead:

var pattern = @"<.*href=""(?<url>.*)"".*>(?<name>.*)</a>";
var matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
    var groups = match.Groups;
    Console.WriteLine("{0}, {1}", groups["url"], groups["name"]);
}


At the end. I made a separate .net console program combining HtmlAgilityPack (HAP) and querying SQL Server from there. In the program I did use a naive regular expression to isolate the fragments, and with HAP I did retrieve the href and anchor texts, and with that I did a final composition stripping out any other characters except text, numbers, and some punctuation.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号