Scraping landing pages of a list of domains [closed]_问答_开发者

Scraping landing pages of a list of domains [closed]

开发者 https://www.devze.com 2022-12-24 21:53 出处：网络

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical andcannot be reasonably answered in its current form. For help clari

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 10 years ago.

I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than开发者_运维技巧 I'd like and I am wondering if wget or an other alternative would be faster given how straightforward the task is. Any ideas?

(Here's what I am doing with Scrapy. Anything I can do to optimize scrapy for this task? )

So, I have a start URLs list like

start_urls=[google.com yahoo.com aol.com]

And I scrape the text from each response and store this in an xml. I need to turn of the offsitemiddleware to allow for multiple domains.

Scrapy works as expected, but seems slow (About 1000 in an hour or 1 every 4 seconds). Is there a way to speed this up by increasing the number of CONCURRENT_REQUESTS_PER_SPIDER while running a single spider? Anything else?

If you want a way to concurrently download multiple sites with python, you can do so with the standard libraries like this:

import threading
import urllib

maxthreads = 4

sites = ['google.com', 'yahoo.com', ] # etc.

class Download(threading.Thread):
   def run (self):
       global sites
       while sites:
           site = sites.pop()
           print "start", site
           urllib.urlretrieve('http://' + site, site)
           print "end  ", site

for x in xrange(min(maxthreads, len(sites))):
    Download().start()

You could also check out httplib2 or PycURL to do the downloading for you instead of urllib.

I'm not clear exactly how you want the scraped text as xml to look, but you could use xml.etree.ElementTree from the standard library or you could install BeautifulSoup (which would be better as it handles malformed markup).