目录
- 引言
- 问题的提出
- 问题分析
- 问题的解决
- 总结
引言
在Scrapy中,在很多种情况下,需要一层层地进行爬取网页数据,就是基于url爬取网页,然后在从网页中提取url,继续爬取,循环往复。
本文将讲述一个在迭代爬取中,只能爬取第一层网页的问题。
问题的提出
scrapy crawl enrolldata
Scrapy代码执行结果输出如下:“`2018-05-06 17:23:06 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: enrolldata)2018-05-06 17:23:06 [scrapy.utils.log] INFO: Versions: lXML 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, python 3.6.1 (default, Apr 24 2017, 23:31:02) - [GCC 6.2.0 20161005], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform linux-4.15.0-20-gener开发者_C开发ic-x86_64-with-Debian-buster-sid2018-05-06 17:23:06 [scrapy.crawler] INFO: Overridden settings: {‘BOT_NAME’: ‘enrolldata’, ‘CONCURRENT_REQUESTS’: 60, ‘CONCURRENT_REQUESTS_PER_IP’: 60, ‘DEPTH_LIMIT’: 5, ‘NEWSPIDER_MODULE’: ‘enrolldata.spiders’, ‘SPIDER_MODULES’: [‘enrolldata.spiders’]}2018-05-06 17:23:06 [scrapy.middleware] INFO: Enabled extensions:[‘scrapy.extensions.corestats.CoreStats’,‘scrapy.extensions.telnet.TelnetConsole’,‘scrapy.extensions.memusage.MemoryUsage’,‘scrapy.extensions.logstats.LogStats’]2018-05-06 17:23:06 [scrapy.middleware] INFO: Enabled downloader middlewares:[‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,‘scrapy.downloadermiddlewares.stats.DownloaderStats’]2018-05-06 17:23:06 [scrapy.middleware] INFO: Enabled spider middlewares:[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,‘scrapy.spidermiddlewares.referer.RefererMiddleware’,‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,‘scrapy.spidermiddlewares.depth.DepthMiddleware’]2018-05-06 17:23:06 [scrapy.middleware] INFO: Enabled item pipelines:[‘enrolldata.pipelines.EnrolldataPipeline’]2018-05-06 17:23:06 [scrapy.core.engine] INFO: Spider openedopen spider ………..pipeline2018-05-06 17:23:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2018-05-06 17:23:06 [py.warnings] WARNING: /home/bladestone/codebase/python36env/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py:59: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://www.heao.gov.cn/ in allowed_domains.w编程arnings.warn(“allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_domains.” % domain, URLWarning)
2018-05-06 17:23:06 [scrapy.extensions.telnet] DEBUG: Telnpythonet console listening on 127.0.0.1:6023
end of start requests2018-05-06 17:23:14 [scrapy.core.engine] DEBUG: Crawled (200)
-*- coding: utf-8 -*- import scrapy from enrolldata.items import EnrolldataItem from scrapy.http import FormRequest class SchoolspiderSpider(scrapy.Spider): name = 'enrolldata' cookies = {} allowed_domains = ['http://www.heao.gov.cn/'] start_urls = XGQiPD['http://www.heao.gov.cn/JHCX/PZ/enrollplan/SchoolList.ASPx'] ............. def start_requests(self): formdata = {} fhttp://www.devze.comormdata['PagesUpDown$edtPage'] = '1' formdata['__EVENTTARGET'] = 'PagesUpDown$lbtnGO' formdata['__EVENTARGUMENT'] = '' formdata['__VIEWSTATE'] = '/wEPdwUKMjA1MTU4MDA1Ng9kFgICBQ9kFgICAQ8PFggeDGZDdXJyZW50UGFnZQIBHhFmVG90YWxSZWNvcmRDb3VudAK4ER4KZlBhZ2VDb3VudAKVAR4JZlBhZ2VTaXplAg9kZGSI36vb/TsBmDT8pwwx37ajH1x0og==' formdata['__VIEWSTATEGENERATOR'] = 'AABB4DD8' formdata['__EVENTVALIDATION'] = '/wEWBQLYvvTTCwK2r/yJBQK6r/CJBgLqhPDLCwLQ0r3uCMy0KhJCAT8jebTQL0eNdj7uk4L5' for i in range(1, 2): formdata['PagesUpDown$edtPage'] = str(i) yield FormRequest(url=self.start_urls[0], headers=self.headers, formdata=formdata, callback=self.parse_school) print("end of start requests") def parse(self, response): print("parse method is invoked") pass def parse_school(self, response): print("parse school data.....") urls = response.xpath('//*[@id="SpanSchoolList"]/div/div[2]/ul/li/a/@href').extract(); print("print out all the matched urls") print(urls) for url in urls: request_url = self.base_url + url print("request_url in major:" + request_url) yield scrapy.Request(request_url, headers=self.request_headers, cookies=self.cookies, callback=self.parse_major_enroll, meta=self.meta) ......
代码没有报错,只是输出了第一层的Web的爬取结果。但是第二层没有执行爬取。
问题分析
从日志来进行分析,没有发现错误信息;第一层代码爬取正确,但是第二层web爬取,没有被执行,代码的编写应该没有问题的。
那问题是什么呢?会不会代码没有被执行呢?通过添加日志,但是对应的代码并没有执行,日志也被正常输出。是不是被过滤或者拦截了,从而代码没有被执行?
经过代码审查之后,发现allowed_domains设置的问题,由于起设置不正确,导致其余的链接被直接过滤了。
关于allowed_domains需要是一组域名,而非一组urls。
问题的解决
需要将之前的domain name修改一下:
allowed_domains = [‘http://www.heao.gov.cn/‘]
将起修改为:
allowed_domains = [‘heao.gov.cn']
重新执行爬虫,发现多个层次是可以被正确爬取的。
总结
关于scrapy是一整套的解决方案,其中很多的设置和配置需要通过不同的php实例来反复理解和应用的,才能如鱼得水,庖丁解牛般快速定位问题。
以上为个人经验,希望能给大家一个参考,也希望大家多多支持我们。
精彩评论