开发者

Scrapy parsing issue with malformed br tags

开发者 https://www.devze.com 2023-03-15 04:19 出处:网络
I have an html file with urls separated with br tags e.g. <a href=\"example.com/page1.开发者_Python百科html\">Site1</a><br/>

I have an html file with urls separated with br tags e.g.

<a href="example.com/page1.开发者_Python百科html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>

Note the line break tag is <br/> instead of <br />. Scrapy is able to parse and extract the first url but fails to extract anything after that. If I put a space before the slash, it works fine. The html is malformed, but I've seen this error in multiple sites and since the browser is able to display it correctly, I'm hoping scrapy (or the underlying lxml / libxml2 / beautifulsoup) should also parse it correctly.


lxml.html parses it fine. Just use that instead of the bundled HtmlXPathSelector.

import lxml.html as lxml

bad_html = """<a href="example.com/page1.html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>"""

tree = lxml.fromstring(bad_html)

for link in tree.iterfind('a'):
    print link.attrib['href']

Results in:

example.com/page1.html
example.com/page2.html
example.com/page3.html

So if you want to use this method in a CrawlSpider, you just need to write a simple (or a complex) link extractor.

Eg.

import lxml.html as lxml

class SimpleLinkExtractor:
    extract_links(self, response):
        tree = lxml.fromstring(response.body)
        links = tree.xpath('a/@href')
        return links

And then use that in your spider..

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(SimpleLinkExtractor(), callback='parse_item'),
    )

    # etc ...


Just use the <br> tags instead of <br/> tags, as suggested by latest conventions.

0

精彩评论

暂无评论...
验证码 换一张
取 消