开发者

Extract content within a tag with BeautifulSoup

开发者 https://www.devze.com 2023-03-06 07:37 出处:网络
I\'d like to extract the content Hello world. Please note that there are multiples <table> and similar <td colspan=\"2\"> on the page as well:

I'd like to extract the content Hello world. Please note that there are multiples <table> and similar <td colspan="2"> on the page as well:

<table border="0" cellspacing="2" width="800">
  <tr>
    <td colspan="2"><b>Name: </b>Hello world</td>
  </tr>
  <tr>
...

I tried the following:

hello = soup.find(text='Name: ')
hello.findPreviousSiblings

But it returned nothing.

In addition, I'm also开发者_开发问答 having problem with the following extracting the My home address:

<td><b>Address:</b></td>

<td>My home address</td>

I'm also using the same method to search for the text="Address: " but how do I navigate down to the next line and extract the content of <td>?


The contents operator works well for extracting text from <tag>text</tag> .


<td>My home address</td> example:

s = '<td>My home address</td>'
soup =  BeautifulSoup(s)
td = soup.find('td') #<td>My home address</td>
td.contents #My home address

<td><b>Address:</b></td> example:

s = '<td><b>Address:</b></td>'
soup =  BeautifulSoup(s)
td = soup.find('td').find('b') #<b>Address:</b>
td.contents #Address:


use next instead

>>> s = '<table border="0" cellspacing="2" width="800"><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>'
>>> soup = BeautifulSoup(s)
>>> hello = soup.find(text='Name: ')
>>> hello.next
u'Hello world'

next and previous let you move through the document elements in the order they were processed by the parser while sibling methods work with the parse tree


Use the below code to get extract text and content from html tags with python beautifulSoup

s = '<td>Example information</td>' # your raw html
soup =  BeautifulSoup(s) #parse html with BeautifulSoup
td = soup.find('td') #tag of interest <td>Example information</td>
td.text #Example information # clean text from html


from bs4 import BeautifulSoup, Tag

def get_tag_html(tag: Tag):
    return ''.join([i.decode() if type(i) is Tag else i for i in tag.contents])
0

精彩评论

暂无评论...
验证码 换一张
取 消