Download an undefined number of files with HttpWebRequest.BeginGetResponse_问答_开发者

I have to write a small app which downloads a few thousand files. Some of these files contain reference to other files that must be downloaded as part of the same process. The following code downloads the initial list of files, but I would like to download the others files as part of the same loop. What is happening here is that the loop completes before the first request come back. Any idea how to achieve this?

var countdownLatch = new CountdownEvent(Urls.Count);

string url;
while (Urls.TryDequeue(out url))
{
    HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
    webRequest.BeginGetResponse(
        new AsyncCallback(ar =>
        {
            using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
            {
                using (var sr = new StreamReader(response.GetRespo开发者_如何学JAVAnseStream()))
                {
                    string myFile = sr.ReadToEnd();

                    // TODO: Look for a reference to another file. If found, queue a new Url.
                }
            }
        }), webRequest);
}

ce.Wait();

One solution which comes to mind is to keep track of the number of pending requests and only finish the loop once no requests are pending and the Url queue is empty:

string url;
int requestCounter = 0;
int temp;
AutoResetEvent requestFinished = new AutoResetEvent(false);
while (Interlocked.Exchange(requestCounter, temp) > 0 || Urls.TryDequeue(out url))
{
    if (url != null)
    {
        Interlocked.Increment(requestCounter);
        HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
        webRequest.BeginGetResponse(
            new AsyncCallback(ar =>
            {
                try {
                    using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
                    {
                        using (var sr = new StreamReader(response.GetResponseStream()))
                        {
                            string myFile = sr.ReadToEnd();

                            // TODO: Look for a reference to another file. If found, queue a new Url.
                        }
                    }
                } 
                finally { 
                    Interlocked.Decrement(requestCounter); 
                    requestFinished.Set();
                }
            }), webRequest);
    }
    else
    {
        // no url but requests are still pending
        requestFinished.WaitOne();
    }
}

You are tryihg to write a webcrawler. In order to write a good webcrawler, you first need to define some parameters...

1) How many request do you want to download simultaneously? In other words, how much throughput do you want? This will determine things like how many requests you want outstanding, what the threadpool size should be etc.

2) You will have to have a queue of URLs. This queue is populated by each request that completes. You now need to decide what the growth strategy of the queue is. For eg, you cannot have an unbounded queue, as you can pump workitems into the queue faster than you can download from the network.

Given this, you can design a system as follows:

Have max N worker threads that actually download from the web. They take one time from the queue, and download the data. They parse the data and populate your URL queue.

If there are more than 'M' URLs in the queue, then the queue blocks and does not allow anymore URLs to be queued. Now, here you can do one of two things. You can either cause the thread that is enqueuing to block, or you can just discard the workitem being enqueued. Once another workitem completes on another thread, and a URL is dequeued, the blocked thread will now be able to enqueue succesfully.

With a system like this, you can ensure that you will not run out of system resources while downloading the data.

Implementation:

Note that if you are using async, then you are using an extra I/O thread to do the download. THis is fine, as long as you are mindful of this fact. You can do a pure async implementation, where you can have 'N' BeginGetResponse() outstanding, and for each one that completes, you start another one. THis way you will always have 'N' requests outstanding.