Scrapy SgmlLinkExtractor question_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-12 22:03 出处：网络

I am trying to make the SgmlLinkExtractor to work. This is the signature: SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=(\'a\', \'area\'), attrs=(\'

I am trying to make the SgmlLinkExtractor to work.

This is the signature:

SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

I am just using allow=()

So, I enter

开发者_开发百科

rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

So, the initial url is 'http://www.whitecase.com/jacevedo/' and I am entering allow=('/aadler',) and expect that '/aadler/' will get scanned as well. But instead, the spider scans the initial url and then closes:

[wcase] INFO: Domain opened
[wcase] DEBUG: Crawled </jacevedo/> (referer: <None>)
[wcase] INFO: Passed NuItem(school=[u'JD, ', u'Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, ', u'2005'])
[wcase] INFO: Closing domain (finished)

What am I doing wrong here?

Is there anyone here who used Scrapy successfully who can help me to finish this spider?

Thank you for the help.

I include the code for the spider below:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u

class NuSpider(CrawlSpider):
    domain_name = "wcase"
    start_urls = ['xxxxxx/jacevedo/']

    rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = NuItem()
        item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
        return item

SPIDER = NuSpider()

Note: SO will not let me post more than 1 url so substitute the initial url as necessary. Sorry about that.

You are overriding the "parse" method it appears. "parse", is a private method in CrawlSpider used to follow links.

if you check documentation a "Warning" is clearly written

"When writing crawl spider rules, avoid using parse as callback, since the Crawl Spider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work."

url for verification

allow=(r'/aadler/', ...

You are missing comma after first element for "rules" to be a tuple..

rules = (Rule(SgmlLinkExtractor(allow=('/careers/n.\w+', )), callback='parse', follow=True),)

Scrapy SgmlLinkExtractor question

精彩评论

关注公众号

热门标签

图文推荐

Scrapy SgmlLinkExtractor question

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：