I'm trying to use BeautifulSoup on the following:
<h4>Hello<br /></h4>
<p><img src="http://url.goes.here" alt="hiya" class="img" />May 28, 1996</p>
For this example, let's say I have the <h4>
tag saved in the variable tag
. When I type print tag.text
the output is Hello
, as expected.
However, when I use print tag.nextSibling
the output is nothing. When I type prin开发者_JS百科t tag.nextSibling.nextSibling
, the output is <p><img src="http://url.goes.here" alt="hiya" class="img" />May 28, 1996</p>
. What is going on? Why do I have to double up on the use of .nextSibling
to get to the <p>
tag in my example? This is consistently an error.
Apparently, .nextSibling will grab white text. So in the actual page I'm working with, there is white text between the <h4>
and <p>
tags, which is why I have to double.
Evidence
Writing:
print tag.__class__
print tag.nextSibling.__class__
print tag.nextSibling.nextSibling.__class__
Yields:
<class 'BeautifulSoup.Tag'>
<class 'BeautifulSoup.NavigableString'>
<class 'BeautifulSoup.Tag'>
Here is what written in the official documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-down
In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document:
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
You might think that the .next_sibling of the first tag would be the second tag. But actually, it’s a string: the comma and newline that separate the first tag from the second:
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
link.next_sibling
# u',\n'
The second tag is actually the .next_sibling of the comma:
link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
精彩评论