Selecting only text within a div tag_问答_开发者

开发者 https://www.devze.com 2023-01-21 06:28 出处：网络

I\'m working on a web parser using urllib.I need to be able to only save lines that lie within a certain div tag.for instance:I\'m saving all text in the div \"b开发者_运维技巧ody.\"This means all tex

相关专题：python urllib

I'm working on a web parser using urllib. I need to be able to only save lines that lie within a certain div tag. for instance: I'm saving all text in the div "b开发者_运维技巧ody." This means all text within the div tags will be returned. It also means if there are other divs inside of it thats fine, but as soon as I hit the parent it stops. Any ideas?

My Idea

search for the div you're looking for.
Record the position.
Keep track of any divs in the future. +1 for new div -1 for end div.
when back to 0, your at your parent div? Save location.
Then save data from beginnning number to end number?

If you're not really excited at the idea of parsing the HTML code yourself, there are two good options:

Beautiful Soup

Lxml

You'll probably find that lxml runs faster than BeautifulSoup, but in my uses, Beautiful Soup was very easy to learn and use, and handled typical crappy HTML as found in the wild well enough that I don't have need for anything else.

YMMV.

Using lxml:

import lxml.html as lh
content='''\
<body>
<div>AAAA
  <div>BBBB
     <div>CCCC
     </div>DDDD
  </div>EEEE
</div>FFFF
</body>
'''
doc=lh.document_fromstring(content)
div=doc.xpath('./body/div')[0]
print(div.text_content())
# AAAA
#   BBBB
#      CCCC
#      DDDD
#   EEEE

div=doc.xpath('./body/div/div')[0]
print(div.text_content())
# BBBB
#      CCCC
#      DDDD

Personally I prefer lxml in general, but there are times where it's HTML handling is a bit off... Here's a BeautifulSoup recipe if it helps.

from BeautifulSoup import BeautifulSoup, NavigableString

def printText(tags):
    s = []
    for tag in tags :
        if tag.__class__ == NavigableString :
            s.append(tag)
        else :
            s.append(printText(tag))
    return "".join(s)

html = "<html><p>Para 1<div class='stuff'>Div Lead<p>Para 2<blockquote>Quote 1</div><blockquote>Quote 2"
soup = BeautifulSoup(html)

v = soup.find('div', attrs={ 'class': 'stuff'})

print v.text_content