In my previous question, I wasn't very specific over my 开发者_运维百科problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling
.
So, here is my code so far:
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/login/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
if not "Hi Herman" in response.body:
return self.login(response)
else:
return self.parse_item(response)
def login(self, response):
return [FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.parse)]
def parse_item(self, response):
i['url'] = response.url
# ... do more things
return i
As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse
function), I call my custom login
function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.
The problem is that the parse
function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.
Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider
) Any help would be appreciated.
Do not override the parse
function in a CrawlSpider
:
When you are using a CrawlSpider
, you shouldn't override the parse
function. There's a warning in the CrawlSpider
documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule
This is because with a CrawlSpider
, parse
(the default callback of any request) sends the response to be processed by the Rule
s.
Logging in before crawling:
In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider
(which inherits from a CrawlSpider
), and override the init_request
function. This function will be called when the spider is initialising, and before it starts crawling.
In order for the Spider to begin crawling, you need to call self.initialized
.
You can read the code that's responsible for this here (it has helpful docstrings).
An example:
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
class MySpider(InitSpider):
name = 'myspider'
allowed_domains = ['example.com']
login_page = 'http://www.example.com/login'
start_urls = ['http://www.example.com/useful_page/',
'http://www.example.com/another_useful_page/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
callback='parse_item', follow=True),
)
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Hi Herman" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
return self.initialized()
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
# Scrape data from page
Saving items:
Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html
If you have any problems/questions in regards to Item
s, don't hesitate to pop open a new question and I'll do my best to help.
In order for the above solution to work, I had to make CrawlSpider inherit from InitSpider, and no longer from BaseSpider by changing, on the scrapy source code, the following. In file scrapy/contrib/spiders/crawl.py:
- add:
from scrapy.contrib.spiders.init import InitSpider
- change
class CrawlSpider(BaseSpider)
toclass CrawlSpider(InitSpider)
Otherwise the spider wouldn't call the init_request
method.
Is there any other easier way?
If what you need is Http Authentication use the provided middleware hooks.
in settings.py
DOWNLOADER_MIDDLEWARE = [ 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware']
and in your spider class
add properties
http_user = "user"
http_pass = "pass"
Just adding to Acorn's answer above. Using his method my script was not parsing the start_urls after the login. It was exiting after a successful login in check_login_response. I could see I had the generator though. I needed to to use
return self.initialized()
then the parse function was called.
精彩评论