Extracting a tag value in BeautifulSoup when unable to match by position or attributes_问答_开发者

Extracting a tag value in BeautifulSoup when unable to match by position or attributes

开发者 https://www.devze.com 2023-01-10 23:28 出处：网络

I\'m using BS to scrape a web page and i\'m a little stuck with a small problem. Here\'s a snippet of HTML from the page.

相关专题：python

I'm using BS to scrape a web page and i'm a little stuck with a small problem. Here's a snippet of HTML from the page.

<span style="font-family: a开发者_StackOverflow中文版rial;"><span style="font-weight: bold;">Artist:</span> M.I.A.<br>
</span>

Once I've got the soup, how can I find this tag and get the artist name i.e. M.I.A. I cannot match the tag with the style attribute as it is used in a dozen places in the page. I don't even know the exact location of the span tag as it changes position from page to page. Therefore, I can't match by position. The artist name changes but the title span structure is always the same.

I would only like the extract the artist name (the M.I.A. bit).

BeautifulSoup is kind of dead, since SGMLParser is deprecated. I suggest you use the better lxml library -- It even has xpath support!!

from lxml import html

text = '''
<span style="font-family: arial;">
    <span style="font-weight: bold;">Artist:</span>M.I.A.<br>
</span>
'''

doc = html.fromstring(text)
print ''.join(doc.xpath("//span/span[text()='Artist:']/../text()"))

This xpath expression means "find the span tag which is inside another span tag and contains the text 'Artist:', and grab all the text of the parent containing tag". It correctly prints M.I.A. as one would expect.