开发者

Extracting a tag value in BeautifulSoup when unable to match by position or attributes

开发者 https://www.devze.com 2023-01-10 23:28 出处:网络
I\'m using BS to scrape a web page and i\'m a little stuck with a small problem. Here\'s a snippet of HTML from the page.

I'm using BS to scrape a web page and i'm a little stuck with a small problem. Here's a snippet of HTML from the page.

<span style="font-family: a开发者_StackOverflow中文版rial;"><span style="font-weight: bold;">Artist:</span> M.I.A.<br>
</span>

Once I've got the soup, how can I find this tag and get the artist name i.e. M.I.A. I cannot match the tag with the style attribute as it is used in a dozen places in the page. I don't even know the exact location of the span tag as it changes position from page to page. Therefore, I can't match by position. The artist name changes but the title span structure is always the same.

I would only like the extract the artist name (the M.I.A. bit).


BeautifulSoup is kind of dead, since SGMLParser is deprecated. I suggest you use the better lxml library -- It even has xpath support!!

from lxml import html

text = '''
<span style="font-family: arial;">
    <span style="font-weight: bold;">Artist:</span>M.I.A.<br>
</span>
'''

doc = html.fromstring(text)
print ''.join(doc.xpath("//span/span[text()='Artist:']/../text()"))

This xpath expression means "find the span tag which is inside another span tag and contains the text 'Artist:', and grab all the text of the parent containing tag". It correctly prints M.I.A. as one would expect.

0

精彩评论

暂无评论...
验证码 换一张
取 消