开发者

using nextSibling from BeautifulSoup outputs nothing

开发者 https://www.devze.com 2023-02-26 08:43 出处:网络
I\'m trying to use BeautifulSoup on the following: <h4>Hello<br /></h4> <p><img src=\"http://url.goes.here\" alt=\"hiya\" class=\"img\" />May 28, 1996</p>

I'm trying to use BeautifulSoup on the following:

<h4>Hello<br /></h4>
<p><img src="http://url.goes.here" alt="hiya" class="img" />May 28, 1996</p>

For this example, let's say I have the <h4> tag saved in the variable tag. When I type print tag.text the output is Hello, as expected.

However, when I use print tag.nextSibling the output is nothing. When I type prin开发者_JS百科t tag.nextSibling.nextSibling, the output is <p><img src="http://url.goes.here" alt="hiya" class="img" />May 28, 1996</p>. What is going on? Why do I have to double up on the use of .nextSibling to get to the <p> tag in my example? This is consistently an error.


Apparently, .nextSibling will grab white text. So in the actual page I'm working with, there is white text between the <h4> and <p> tags, which is why I have to double.

Evidence

Writing:

print tag.__class__
print tag.nextSibling.__class__
print tag.nextSibling.nextSibling.__class__

Yields:

<class 'BeautifulSoup.Tag'>
<class 'BeautifulSoup.NavigableString'>
<class 'BeautifulSoup.Tag'>


Here is what written in the official documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-down

In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

You might think that the .next_sibling of the first tag would be the second tag. But actually, it’s a string: the comma and newline that separate the first tag from the second:

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
# u',\n'

The second tag is actually the .next_sibling of the comma:

link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
0

精彩评论

暂无评论...
验证码 换一张
取 消