开发者

Processing a HTML file using Python

开发者 https://www.devze.com 2023-04-11 19:44 出处:网络
I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line <h1>Hello World!</h1>.I want to retain only \"Hello World!\". In order to

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this 开发者_如何学Cissue?


You can make the match non-greedy: '<.*?>'

You also need to be careful, HTML is a crafty beast, and can thwart your regexes.


Parse the HTML using BeautifulSoup, then only retrieve the text.


make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/


Use a parser, either lxml or BeautifulSoup:

import lxml.html
print lxml.html.fromstring(mystring).text_content()

Related questions:

Using regular expressions to parse HTML: why not?

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms


Beautiful Soup is great for parsing html!

You might not require it now, but it's worth learning to use it. Will help you in the future too.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号