For starters I'm new to python so my code below may not be the cleanest. For a program I need to download about 500 webpages. The url's are stored in an array which is populated by a previous function. The downloading part goes something like this:
def downloadpages(num):
import urllib
for i in range(0,numPlanets):
urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')
each file is only around 20KB but it takes at least 10 mins to download all of them. Downloading a single file of the total combined size should only take a minute or two. Is there a way I can speed this up? Thanks
Edit: To anyone who is interested, following the example at http://code.g开发者_JAVA百科oogle.com/p/workerpool/wiki/MassDownloader and using 50 threads, the download time has been reduced to about 20 seconds from the original 10 minutes plus. The download speed continues to decrease as the threads are increased up until around 60 threads, after which the download time begins to rise again.
But you're not downloading a single file, here. You're downloading 500 separate pages, each connection involves overhead (for the initial connection), plus whatever else the server is doing (is it serving other people?).
Either way, downloading 500 x 20kb is not the same as downloading a single file of that size.
You can speed up execution significantly by using threads (be careful though, to not overload the server).
Intro material/Code samples:
- http://docs.python.org/library/threading.html
- Python Package For Multi-Threaded Spider w/ Proxy Support?
- http://code.google.com/p/workerpool/wiki/MassDownloader
You can use greenlet to do so.
E.G with the eventlet lib:
urls = [url1, url2, ...]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
All calls in the pools will be pseudo simulatneous.
Of course you must install eventlet with pip or easy_install before.
You have several implementations of greenlets in Python. You could do the same with gevent or another one.
In addition to using concurrency of some sort, make sure whatever method you're using to make the requests uses HTTP 1.1 connection persistence. That will allow each thread to open only a single connection and request all the pages over that, instead of having a TCP/IP setup/teardown for each request. Not sure if urllib2 does that by default; you might have to roll your own.
精彩评论