I'm working on a web parser using urllib. I need to be able to only save lines that lie within a certain div tag. for instance: I'm saving all text in the div "b开发者_运维技巧ody." This means all text within the div tags will be returned. It also means if there are other divs inside of it thats fine, but as soon as I hit the parent it stops. Any ideas?
My Idea
search for the div you're looking for.
Record the position.
Keep track of any divs in the future. +1 for new div -1 for end div.
when back to 0, your at your parent div? Save location.
Then save data from beginnning number to end number?
If you're not really excited at the idea of parsing the HTML code yourself, there are two good options:
Beautiful Soup
Lxml
You'll probably find that lxml runs faster than BeautifulSoup, but in my uses, Beautiful Soup was very easy to learn and use, and handled typical crappy HTML as found in the wild well enough that I don't have need for anything else.
YMMV.
Using lxml:
import lxml.html as lh
content='''\
<body>
<div>AAAA
<div>BBBB
<div>CCCC
</div>DDDD
</div>EEEE
</div>FFFF
</body>
'''
doc=lh.document_fromstring(content)
div=doc.xpath('./body/div')[0]
print(div.text_content())
# AAAA
# BBBB
# CCCC
# DDDD
# EEEE
div=doc.xpath('./body/div/div')[0]
print(div.text_content())
# BBBB
# CCCC
# DDDD
Personally I prefer lxml in general, but there are times where it's HTML handling is a bit off... Here's a BeautifulSoup recipe if it helps.
from BeautifulSoup import BeautifulSoup, NavigableString
def printText(tags):
s = []
for tag in tags :
if tag.__class__ == NavigableString :
s.append(tag)
else :
s.append(printText(tag))
return "".join(s)
html = "<html><p>Para 1<div class='stuff'>Div Lead<p>Para 2<blockquote>Quote 1</div><blockquote>Quote 2"
soup = BeautifulSoup(html)
v = soup.find('div', attrs={ 'class': 'stuff'})
print v.text_content
精彩评论