开发者

fromstring() -> tostring() modifies the overall HTML structure

开发者 https://www.devze.com 2023-03-30 07:24 出处:网络
I am trying to use lxml.html for writing a cleanup routine to remove empty DIV elements having no content. During the debugging I noticed that

I am trying to use lxml.html for writing a cleanup routine to remove empty DIV elements having no content. During the debugging I noticed that a standard tostring() -> fromstring() iteration modifies my HTML. Firstly it removes the outer body tag and secondly it changes the DIV structure.

Why?

(Pdb) from lxml.html import fromstring, tostring
(Pdb) print html

<body>
<div></div>
<p>hello world</p&g开发者_StackOverflowt;
<div>   </div>
<p><div> </div></p>
</body>

(Pdb) print tostring(fromstring(html))
<div>
<div></div>
<p>hello world</p>
<div>   </div>
<p></p><div> </div>
</div>


That's correct. While your example is well-formed, it is not valid html so lxml tries to correct it to the best of its ability. In particular the div element can't be nested inside p elements and the root tag cannot be body. Use the etree module instead:

from lxml.etree import fromstring, tostring
0

精彩评论

暂无评论...
验证码 换一张
取 消