Scrapy CrawlSpider Post-processing: Finding an Average_问答_开发者

Scrapy CrawlSpider Post-processing: Finding an Average

开发者 https://www.devze.com 2023-02-20 14:23 出处：网络

Let\'s say I have a crawl spider similar to this example: from scrapy.contrib.spiders import CrawlSpider, Rule

Let's say I have a crawl spider similar to this example: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = Item()
        item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
        item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
        return item

Let's say I wanted to get some information like the sum of the IDs from each of the pages, or the average number of characters in the description across all of开发者_如何学运维 the parsed pages. How would I do it?

Also, how could I get averages for a particular category?

You could use Scrapy's stats collector to build this kind of information or gather the necessary data to do so as you go. For per-category stats, you could use a per-category stats key.

For a quick dump of all stats gathered during a crawl, you can add STATS_DUMP = True to your settings.py.

Redis (via redis-py) is also a great option for stats collection.