I am trying to use lxml.html for writing a cleanup routine to remove empty DIV elements having no content. During the debugging I noticed that a standard tostring() -> fromstring() iteration modifies my HTML. Firstly it removes the outer body tag and secondly it changes the DIV structure.
Why?
(Pdb) from lxml.html import fromstring, tostring
(Pdb) print html
<body>
<div></div>
<p>hello world</p&g开发者_StackOverflowt;
<div> </div>
<p><div> </div></p>
</body>
(Pdb) print tostring(fromstring(html))
<div>
<div></div>
<p>hello world</p>
<div> </div>
<p></p><div> </div>
</div>
That's correct. While your example is well-formed, it is not valid html so lxml tries to correct it to the best of its ability. In particular the div element can't be nested inside p elements and the root tag cannot be body. Use the etree module instead:
from lxml.etree import fromstring, tostring
精彩评论