Using BeautifulSoup to parse my XML
import BeautifulSoup
soup = BeautifulSoup.BeautifulStoneSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan'])
print soup.prettify()
This will output:
<alan x="y">
<anne>
hello
</anne>
</alan>
ie, the anne tag is a child of the alan tag.
If I pass selfClosingTags=['alan'] when I create the soup, I get:
<alan x="y" />
<anne>
hello
</anne>
Great!
My question: why can't the presen开发者_Go百科ce of the />
be used to indicate a self closing tag?
You are asking what was in the mind of an author, after having noted that he gives names like Beautiful[Stone]Soup to classes/modules :-)
Here are two more examples of the behaviour of BeautifulStoneSoup:
>>> soup = BeautifulSoup.BeautifulStoneSoup(
"""<alan x="y" ><anne>hello</anne>"""
)
>>> print soup.prettify()
<alan x="y">
<anne>
hello
</anne>
</alan>
>>> soup = BeautifulSoup.BeautifulStoneSoup(
"""<alan x="y" ><anne>hello</anne>""",
selfClosingTags=['alan'])
>>> print soup.prettify()
<alan x="y" />
<anne>
hello
</anne>
>>>
My take: a self-closing tag is not legal if it is not defined to the parser. So the author had choices when deciding how to handle an illegal fragment like <alan x="y" />
... (1) assume that the /
was a mistake (2) treat alan
as a self-closing tag quite independently of how it might be used elsewhere in the input (3) make 2 passes over the input nutting out in the first pass how each tag was used. Which choice do you prefer?
I don't have a "why", but this might be of interest to you. If you use BeautifulSoup
(no Stone) to parse XML with a self-closing tag, it works. Sort of:
>>> soup = BeautifulSoup.BeautifulSoup( """<alan x="y" /><anne>hello</anne>""" ) # selfClosingTags=['alan'])
>>> print soup.prettify()
<alan x="y">
</alan>
<anne>
hello
</anne>
The nesting is right, even if alan
is rendered as a start and an end tag.
精彩评论