selfClosingTags in BeautifulSoup_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-19 03:10 出处：网络

Using BeautifulSoup to parse my XML import BeautifulSoup soup = BeautifulSoup.BeautifulStoneSoup( \"\"\"<alan x=\"y\" /><anne>hello</anne>\"\"\" ) # selfClosingTags=[\'alan\'])

相关专题：python xml

Using BeautifulSoup to parse my XML

import BeautifulSoup

soup = BeautifulSoup.BeautifulStoneSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan'])

print soup.prettify()

This will output:

<alan x="y">
 <anne>
  hello
 </anne>
</alan>

ie, the anne tag is a child of the alan tag.

If I pass selfClosingTags=['alan'] when I create the soup, I get:

<alan x="y" />
<anne>
 hello
</anne>

Great!

My question: why can't the presen开发者_Go百科ce of the /> be used to indicate a self closing tag?

You are asking what was in the mind of an author, after having noted that he gives names like Beautiful[Stone]Soup to classes/modules :-)

Here are two more examples of the behaviour of BeautifulStoneSoup:

>>> soup = BeautifulSoup.BeautifulStoneSoup(
    """<alan x="y" ><anne>hello</anne>"""
    )
>>> print soup.prettify()
<alan x="y">
 <anne>
  hello
 </anne>
</alan>

>>> soup = BeautifulSoup.BeautifulStoneSoup(
    """<alan x="y" ><anne>hello</anne>""",
    selfClosingTags=['alan'])
>>> print soup.prettify()
<alan x="y" />
<anne>
 hello
</anne>
>>>

My take: a self-closing tag is not legal if it is not defined to the parser. So the author had choices when deciding how to handle an illegal fragment like <alan x="y" /> ... (1) assume that the / was a mistake (2) treat alan as a self-closing tag quite independently of how it might be used elsewhere in the input (3) make 2 passes over the input nutting out in the first pass how each tag was used. Which choice do you prefer?

I don't have a "why", but this might be of interest to you. If you use BeautifulSoup (no Stone) to parse XML with a self-closing tag, it works. Sort of:

>>> soup = BeautifulSoup.BeautifulSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan'])
>>> print soup.prettify()
<alan x="y">
</alan>
<anne>
 hello
</anne>

The nesting is right, even if alan is rendered as a start and an end tag.