Is there a way to gather all links that has a specific domain in a string where they only include ones that are either:
href="http://yahoo.com/media/news.html"
or
>http://yahoo.com/media/news.html<
So basically links either prefixed by href="
and ends with "
or
links开发者_开发百科 that are surrounded by ><
.
I tried to use Regex ( "href=\"([^\"]*)\"></A>" )
but didn't match anything.
Try the following:
string[] inputs = { "href=\"http://yahoo.com/media/news.html\"", ">http://yahoo.com/media/news.html<" };
string pattern = @"(?:href=""|>)(?<Url>http://.+?)[<""]";
foreach (string input in inputs)
{
Match m = Regex.Match(input, pattern);
if (m.Success)
{
Console.WriteLine(m.Groups["Url"].Value);
}
}
EDIT: another approach is to use look-arounds so that the text is matched but not captured. This allows you to use Match.Value
directly instead of using groups. Try this alternate approach below.
string pattern = @"(?<=href=""|>)http://.+?(?=<|"")";
foreach (string input in inputs)
{
Match m = Regex.Match(input, pattern);
if (m.Success)
{
Console.WriteLine(m.Value);
}
}
EDIT #2: per the request in the comments here is a pattern that will not match URLs that contain "..." in the text.
string pattern = @"(?<=href=""|>)http://(?!.*\.{3}).+?(?=<|"")";
The only change is the addition of (?!.*\.{3})
which is a negative look-ahead that allows the pattern to match if the specified suffix is absent. In this case it checks that the "..." is absent. If you need to match at least 3 dots then use {3,}
.
(href="[^"]*")|(>[^<]*<)
Starts with href=", followed by characters that are not ", ending with "
or
Starts with >, followed by characters that are not <, ending with <
try:
http=\"(.+)\"
精彩评论