开发者

Python keeping newlines in lxml.html after cssselect and text_content()

开发者 https://www.devze.com 2023-01-26 01:06 出处:网络
In p开发者_C百科ython, How do I preserve paragraphs (i.e. keep newlines) when using lxml.html? For example, the following will strip <p></p> tags and join the lines, which is not what I w

In p开发者_C百科ython, How do I preserve paragraphs (i.e. keep newlines) when using lxml.html?

For example, the following will strip <p></p> tags and join the lines, which is not what I want:

body = doc.cssselect("div.body")[0]
content = body.text_content()

Here's what I've tried that doesn't work:

  • lxml.html.clean.clean_html:
    • Won't preserve the newlines.
  • content.replace(" "*3,"\n\n"):
    • Doesn't work consistently, because combined text does not have the same number of spaces.


The lxml text_content is doing what is supposed to according to the docs, it is stripping the html tags and leaving the text behind.

You can fix this up by adding your own newlines before outputting the content.

body = doc.cssselect("div.body")[0]
for para in body.xpath("*//p"):
    para.text = "\n%s\n" % para.text
content = body.text_content()
print content
0

精彩评论

暂无评论...
验证码 换一张
取 消