Extracting URLs using regex in .NET_问答_开发者

I've taken inspiration from the example show in the following URL csharp-online and intended to retrieve all the URLs from this page alexa

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Text.RegularExpressions;
namespace ExtractingUrls
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient client = new WebClient();
            const string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
            string source = client.DownloadString(url);
            //Console.WriteLine(Getvals(source));
            string matchPattern =
                    @"<a.rel=""nofollow"".style=""font-size:0.8em;"".href=[""'](?<url>[^""^']+[.]*)[""'].class=""offsite"".*>(?<name>[^<]+[.]*)</a>";
            foreach (Hashtable grouping in ExtractGroupings(source, matchPattern, true))
            {
                foreach (DictionaryEntry DE in groupi开发者_StackOverflowng)
                {
                    Console.WriteLine("Value = " + DE.Value);
                    Console.WriteLine("");
                }
            }
            // End.
            Console.ReadLine();
        }
        public static ArrayList ExtractGroupings(string source, string matchPattern, bool wantInitialMatch)
        {
            ArrayList keyedMatches = new ArrayList();
            int startingElement = 1;
            if (wantInitialMatch)
            {
                startingElement = 0;
            }
            Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
            MatchCollection theMatches = RE.Matches(source);
            foreach (Match m in theMatches)
            {
                Hashtable groupings = new Hashtable();
                for (int counter = startingElement; counter < m.Groups.Count; counter++)
                {
                    // If we had just returned the MatchCollection directly, the
                    // GroupNameFromNumber method would not be available to use
                    groupings.Add(RE.GroupNameFromNumber(counter),
                    m.Groups[counter]);
                }
                keyedMatches.Add(groupings);
            }
            return (keyedMatches);
        }
    }
}

But here I face a problem, when I'm executing each URL is being displayed thrice, That's first the whole anchor tag is getting displayed, next the URL is being displayed twice. can anyone suggest me where should I correct so that I can have each URL displayed exactly once.

Use HTML Agility Pack to parse HTML. I think it will make your problem much easier to solve.

Here's one way to do it:

WebClient client = new WebClient();
string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
string source = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']"))
{
    Console.WriteLine(link.Attributes["href"].Value);
}

in your regex, you have two groupings, and the entire match. If I'm reading it correctly, you should only want the URL portion of the matches, which is the second of the 3 groupings....

instead of this:

for (int counter = startingElement; counter < m.Groups.Count; counter++)
            {
                // If we had just returned the MatchCollection directly, the
                // GroupNameFromNumber method would not be available to use
                groupings.Add(RE.GroupNameFromNumber(counter),
                m.Groups[counter]);
            }

don't you want this?:

groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]);

int startingElement = 1;
if (wantInitialMatch)
{
startingElement = 0;
}

...

for (int counter = startingElement; counter < m.Groups.Count; counter++)
{
// If we had just returned the MatchCollection directly, the
// GroupNameFromNumber method would not be available to use
    groupings.Add(RE.GroupNameFromNumber(counter),
    .Groups[counter]);
}

Your passing wantInitialMatch = true, so your for loop is returning:

.Groups[0] //entire match
.Groups[1] //(?<url>[^""^']+[.]*) href part
.Groups[2] //(?<name>[^<]+[.]*) link text

take a look of this: http://bouncetadiss.blogspot.com/2008/02/parsing-uri-url-in-c-and-vbnet.html

Extracting URLs using regex in .NET

精彩评论

关注公众号

热门标签

图文推荐

Extracting URLs using regex in .NET

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：