Using scrapy to get links in Python?_问答_开发者

开发者 https://www.devze.com 2023-03-30 09:32 出处：网络

Sorry if this is a dumb question, but I have absolutely no idea how to use Scrapy. I don\'t want to create a Scrapy crawler (or w/e), I want to incorporate it into my existing code. I\'ve looked at th

相关专题：python

Sorry if this is a dumb question, but I have absolutely no idea how to use Scrapy. I don't want to create a Scrapy crawler (or w/e), I want to incorporate it into my existing code. I've looked at the docs, but I found them a bit confusing.

What I need to do is, get links from a开发者_运维百科 list on the site. I just need an example to better understand it. Also, is it possible to have a for loop to do something with each list item? They are ordered like

<ul>
  <li>example</li>
</ul>

Thanks!

You might want to consider BeautifulSoup, which is great for parsing HTML/XML, their documentation is quite helpful as well. Getting the links would be something like:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        print link['href']

SoupStrainer removes the need to parse the entire thing when all you're after are the links.

EDIT: Just saw that you need to use Scrapy. I'm afraid I've not used it, but try looking at the official documentation, it looks like they have what you might be after.

maybe you don't need scrappy if it's that simple.

cat local.html

<html><body>
<ul>  
<li>example</li>  
<li>example2</li>
</ul>
<div><a href="test">test</a><div><a href="hi">hi</a></div></div>
</body></html>

then...

import urllib2
from lxml import html

page =urllib2.urlopen("file:///root/local.html")
root = html.parse(page).getroot()
details = root.cssselect("li")
for x in details:
        print(x.text_content())

for x in root.xpath('//a/@href'):
        print(x)