Since nothing so far is working I started a new project 开发者_JAVA百科with
python scrapy-ctl.py startproject Nu
I followed the tutorial exactly, and created the folders, and a new spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = ['http://www.whitecase.com/aabbas/']
names = hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
def parse(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
return item
SPIDER = NuSpider()
and when I run
C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase
I get
[Nu] ERROR: Could not find spider for domain: wcase
The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong?
Thanks for your help!
Please also check the version of scrapy. The latest version uses "name" instead of "domain_name" attribute to uniquely identify a spider.
Have you included the spider in SPIDER_MODULES
list in your scrapy_settings.py?
It's not written in the tutorial anywhere that you should to this, but you do have to.
These two lines look like they're causing trouble:
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
- Only one rule will be followed each time the script is run. Consider creating a rule for each URL.
- You haven't created a
parse_item
callback, which means that the rule does nothing. The only callback you've defined isparse
, which changes the default behaviour of the spider.
Also, here are some things that will be worth looking into.
CrawlSpider
doesn't like having its defaultparse
method overloaded. Search forparse_start_url
in the documentation or the docstrings. You'll see that this is the preferred way to override the defaultparse
method for your starting URLs.NuSpider.hxs
is called before it's defined.
I believe you have syntax errors there. The name = hxs...
will not work because you don't get defined before the hxs
object.
Try running python yourproject/spiders/domain.py
to get syntax errors.
You are overriding the parse
method, instead of implementing a new parse_item
method.
精彩评论