开发者

Conducting external website searches from code

开发者 https://www.devze.com 2023-03-15 20:18 出处:网络
I have a csv file with lastname, firstname, and postalcode.I would like to write a .NET program to automatically search www.canada411.com for the person\'s postal code and lastname, and record all the

I have a csv file with lastname, firstname, and postalcode. I would like to write a .NET program to automatically search www.canada411.com for the person's postal code and lastname, and record all the results in the database.

I ha开发者_开发问答ve no idea how to go about this, but these are the steps I need to do:

  1. Read the File (I can do this)
  2. Search www.canada411.com with the information from the file (no idea how to do this)
  3. Identify the results section of the page (no idea how to do this)
  4. For each results for the search, read the result (no idea how to do this) and store in the database (I can do this last bit).

Can you help point me in the right direction? Many thanks in advance


What you are referring to is screen scraping, a highly unreliable method of parsing the results of a web page into meaningful information.

You would be much better off finding a 'post code lookup service' that exposes an API for programatically retrieving this information. This way your code isn't going to break just because the provider changes the design of their web page.

However, to achieve what you are asking, you can use WebClient or construct a HttpWebRequest. You can then parse the response to the find the area of html you are interested in.

Example of using HttpWebRequest - http://wiki.asp.net/page.aspx/285/httpwebrequest/
Best tool for parsing html - http://htmlagilitypack.codeplex.com/


Fun question.

1) To get the page results for the persons name Variety of ways but I recommend WebClient to URL http://www.canada411.ca/search/?stype=si&what=Smith%2C+John substituting the words "Smith" and "John" with appropriate URL encoded values

2) With the result returned load into an XML Reader object

3) Using LINQ to XML or another format such as XPATH gather all Div Elements with class = "listing"

4) For each Element 3 above using LINQ to XML or an XDocument to read the values from the node and store into instance variables accordingly. Some parsing logic will be required.

5) Insert the new record into your database or update an existing record

6) Repeat for all listing nodes

If all the information above doesn't make sense to you then I'm afraid there isn't a simple answer. Easiest way is to use some government sponsored free web service if you can find one and get the results back in a consistent manner.

Keep in mind any changes to their page layout, class names etc will break your code. Highly unreliable way to gathering information but might work for an initial database load etc.


I was super bored so:

public class FourElevenLookup
{
    private const string URL = "http://www.canada411.ca/search/";
    private const string TYPE_PARAM = "?stype=si";
    private const string WHAT_PARAM = "&what=";
    private const string WHERE_PARAM = "&where=";

    public static List<SearchResult> GetResults(string lastName, string postalCode)
    {
        List<SearchResult> results = new List<SearchResult>();
        string fullUrl = URL + TYPE_PARAM + WHAT_PARAM + lastName +
            WHERE_PARAM + postalCode.Replace(" ", "+");
        string rawText = GetHtml(fullUrl);
        Regex getListings = new Regex("\\<\\!\\-\\- (listingDetail|listing) \\-\\-\\>(?<content>" + 
            "(.(?!(\\<\\!\\-\\- (\\/ listingDetail|listing) \\-\\-\\>)))*)", RegexOptions.Singleline);
        MatchCollection mc = getListings.Matches(rawText);
        List<string> rawListings = new List<string>();
        for (int i = 0; i < mc.Count; i++)
            rawListings.Add(mc[i].Groups["content"].Value);
        Regex parseListing = new Regex("\\<div class=\"c411ListingInfo\"\\>(.(?!a href=))*\\<a href\\=(.(?!\\>))*\">" + 
            "(?<name>[\\w- ]*)\\<\\/a\\>\\<br\\/\\>(.(?!span))*\\<span class\\=\"address\"\\>" + 
            "(?<address>(.(?!\\/span\\>))*)", RegexOptions.Singleline);            
        rawListings.ForEach(s =>
        {
            Match m = parseListing.Match(s);
            results.Add(new SearchResult()
            {
                Name = m.Groups["name"].Value,
                Address = m.Groups["address"].Value.Replace("<br/>", "")
            });
        });
        return results;
    }

    private static string GetHtml(string strURL)
    {
        string result;
        WebResponse objResponse;
        WebRequest objRequest = System.Net.HttpWebRequest.Create(strURL);
        objResponse = objRequest.GetResponse();
        using (StreamReader sr =
        new StreamReader(objResponse.GetResponseStream()))
        {
            result = sr.ReadToEnd();
            sr.Close();
        }
        return result;
    }
}

public struct SearchResult 
{
    public string Name { get; set; }
    public string Address { get; set; }
}
0

精彩评论

暂无评论...
验证码 换一张
取 消