开发者

How do I generate a table of contents for HTML text in Python?

开发者 https://www.devze.com 2022-12-19 02:30 出处:网络
Assume that I have some HTML code, like this (generated from Markdown or Textile or something): <h1>A header</h1>

Assume that I have some HTML code, like this (generated from Markdown or Textile or something):

<h1>A header</h1>
<p>Foo</p>
<h2>Another header</h2>
<p>More content</p&g开发者_如何学Ct;
<h2>Different header</h2>
<h1>Another toplevel header
<!-- and so on -->

How could I generate a table of contents for it using Python?


Use an HTML parser such as lxml or BeautifulSoup to find all header elements.


Here's an example using lxml and xpath.

from lxml import etree
doc = etree.parse("test.xml")
for node in doc.xpath('//h1|//h2|//h3|//h4|//h5'):
    print node.tag, node.text
0

精彩评论

暂无评论...
验证码 换一张
取 消