开发者

Simple regex question to parse similar things in .NET?

开发者 https://www.devze.com 2022-12-22 10:40 出处:网络
Is there a way to gather all links that has a specific domain in a string where they only include ones that are either:

Is there a way to gather all links that has a specific domain in a string where they only include ones that are either:

href="http://yahoo.com/media/news.html"

or

>http://yahoo.com/media/news.html<

So basically links either prefixed by href=" and ends with "

or

links开发者_开发百科 that are surrounded by ><.

I tried to use Regex ( "href=\"([^\"]*)\"></A>" ) but didn't match anything.


Try the following:

string[] inputs = { "href=\"http://yahoo.com/media/news.html\"", ">http://yahoo.com/media/news.html<" };

string pattern = @"(?:href=""|>)(?<Url>http://.+?)[<""]";
foreach (string input in inputs)
{
    Match m = Regex.Match(input, pattern);
    if (m.Success)
    {
        Console.WriteLine(m.Groups["Url"].Value);
    }
}

EDIT: another approach is to use look-arounds so that the text is matched but not captured. This allows you to use Match.Value directly instead of using groups. Try this alternate approach below.

string pattern = @"(?<=href=""|>)http://.+?(?=<|"")";
foreach (string input in inputs)
{
    Match m = Regex.Match(input, pattern);
    if (m.Success)
    {
        Console.WriteLine(m.Value);
    }
}

EDIT #2: per the request in the comments here is a pattern that will not match URLs that contain "..." in the text.

string pattern = @"(?<=href=""|>)http://(?!.*\.{3}).+?(?=<|"")";

The only change is the addition of (?!.*\.{3}) which is a negative look-ahead that allows the pattern to match if the specified suffix is absent. In this case it checks that the "..." is absent. If you need to match at least 3 dots then use {3,}.


(href="[^"]*")|(>[^<]*<)

Starts with href=", followed by characters that are not ", ending with "

or

Starts with >, followed by characters that are not <, ending with <


try:

http=\"(.+)\"
0

精彩评论

暂无评论...
验证码 换一张
取 消