Processing a HTML file using Python_问答_开发者

开发者 https://www.devze.com 2023-04-11 19:44 出处：网络

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line <h1>Hello World!</h1>.I want to retain only \"Hello World!\". In order to

相关专题：python regex

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this 开发者_如何学Cissue?

You can make the match non-greedy: '<.*?>'

You also need to be careful, HTML is a crafty beast, and can thwart your regexes.

Parse the HTML using BeautifulSoup, then only retrieve the text.

make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/

Use a parser, either lxml or BeautifulSoup:

import lxml.html
print lxml.html.fromstring(mystring).text_content()

Processing a HTML file using Python

精彩评论

关注公众号

热门标签

图文推荐

Processing a HTML file using Python

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：