开发者

Python: Separating an HTML snippets to paragraphs

开发者 https://www.devze.com 2022-12-20 04:01 出处:网络
I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance:

I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance:

'''
<p开发者_StackOverflow中文版 class="my_class">Hello!</p>
<p>What's up?</p>
<p style="whatever: whatever;">Goodbye!</p>
'''

Should become:

['<p class="my_class">Hello!</p>',
 '<p>What's up?</p>'
 '<p style="whatever: whatever;">Goodbye!</p>']

What would be a good way to approach this?


If your string only contains paragraphs, you may be able to get away with a nicely crafted regex and re.split(). However, if your string is more complex HTML, or not always valid HTML, you might want to look at the BeautifulSoup package.

Usage goes like:

from BeautifulSoup import BeautifulSoup 

soup = BeautifulSoup(some_html)

paragraphs = list(unicode(x) for x in soup.findAll('p'))


Use lxml.html to parse the HTML into the form you want. This is essentially the same advice as the people who are recommending BeautifulSoup, except lxml is still being actively developed and BeatifulSoup development has slowed.


Use BeautifulSoup to parse the HTML and iterate over the paragraphs.


The xml.etree (std lib) or lxml.etree (enhanced) make this easy to do, but I'm not going to get the answer cred for this because I don't remember the exact syntax. I keep mixing it up with similar packages and have to look it up afresh every time.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号