开发者

selfClosingTags in BeautifulSoup

开发者 https://www.devze.com 2022-12-19 03:10 出处:网络
Using BeautifulSoup to parse my XML import BeautifulSoup soup = BeautifulSoup.BeautifulStoneSoup( \"\"\"<alan x=\"y\" /><anne>hello</anne>\"\"\" ) # selfClosingTags=[\'alan\'])

Using BeautifulSoup to parse my XML

import BeautifulSoup

soup = BeautifulSoup.BeautifulStoneSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan'])

print soup.prettify()

This will output:

<alan x="y">
 <anne>
  hello
 </anne>
</alan>

ie, the anne tag is a child of the alan tag.

If I pass selfClosingTags=['alan'] when I create the soup, I get:

<alan x="y" />
<anne>
 hello
</anne>

Great!

My question: why can't the presen开发者_Go百科ce of the /> be used to indicate a self closing tag?


You are asking what was in the mind of an author, after having noted that he gives names like Beautiful[Stone]Soup to classes/modules :-)

Here are two more examples of the behaviour of BeautifulStoneSoup:

>>> soup = BeautifulSoup.BeautifulStoneSoup(
    """<alan x="y" ><anne>hello</anne>"""
    )
>>> print soup.prettify()
<alan x="y">
 <anne>
  hello
 </anne>
</alan>

>>> soup = BeautifulSoup.BeautifulStoneSoup(
    """<alan x="y" ><anne>hello</anne>""",
    selfClosingTags=['alan'])
>>> print soup.prettify()
<alan x="y" />
<anne>
 hello
</anne>
>>>

My take: a self-closing tag is not legal if it is not defined to the parser. So the author had choices when deciding how to handle an illegal fragment like <alan x="y" /> ... (1) assume that the / was a mistake (2) treat alan as a self-closing tag quite independently of how it might be used elsewhere in the input (3) make 2 passes over the input nutting out in the first pass how each tag was used. Which choice do you prefer?


I don't have a "why", but this might be of interest to you. If you use BeautifulSoup (no Stone) to parse XML with a self-closing tag, it works. Sort of:

>>> soup = BeautifulSoup.BeautifulSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan'])
>>> print soup.prettify()
<alan x="y">
</alan>
<anne>
 hello
</anne>

The nesting is right, even if alan is rendered as a start and an end tag.

0

精彩评论

暂无评论...
验证码 换一张
取 消