开发者

python: get image link from html

开发者 https://www.devze.com 2023-03-03 22:28 出处:网络
From a html/rss snippet like this [...]<div class=\"...\" style=\"...\"></div><p><a href=\"...\"

From a html/rss snippet like this

[...]<div class="..." style="..."></div><p><a href="..."
<img alt="" heightt="" src="http://link.to/image"
width="" /></a><span style="">[...]

I want to get the image src link "http://lin开发者_开发问答k.to/image.jpg". How can I do this in python? Thanks.


lxml is the tool for the job.

To scrape all the images from a webpage would be as simple as this:

import lxml.html

tree = lxml.html.parse("http://example.com")
images = tree.xpath("//img/@src")

print images

Giving:

['/_img/iana-logo-pageheader.png', '/_img/icann-logo-micro.png']

If it was an RSS feed, you'd want to parse it with lxml.etree.


Using urllib and beautifulsoup:

import urllib
from BeautifulSoup import BeautifulSoup

f = urllib.urlopen(url)
page = f.read()
f.close()          
soup = BeautifulSoup(page)
for link in soup.findAll('img'):
    print "IMAGE LINKS:", link.get('data-src') 


Perhaps you should start with reading Regex Howto tutorial and a FAQ in the StackOverflow which says that whenever you are dealing with XMLs (HTML) dont use Regex, but rather using good parsers and in your case, BeautifulSoup is one.

Using Regex, you would do this to get the link to your image:

import re
pattern = re.compile(r'src="(http://.*\.jpg)"')
pattern.search("yourhtmlcontainingtheimagelink").group(1)


To add to svick's answer, try using the BeautifuSoup parser, it worked for me in the past.


get html tag data, according to tornado spider

from HTMLParser import HTMLParser

def get_links(html):
    class URLSeeker(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
            self.urls = []

        def handle_starttag(self, tag, attrs):
            if tag == 'img':
                src = dict(attrs).get('src')
                if src:
                    self.urls.append(src)

    url_seeker = URLSeeker()
    url_seeker.feed(html)
    return url_seeker.urls
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号