开发者

Beautiful soup question

开发者 https://www.devze.com 2023-02-03 10:22 出处:网络
I want to fetch specific rows in an HTML document The rows have the following attributes set: bgcolor and vallign

I want to fetch specific rows in an HTML document

The rows have the following attributes set: bgcolor and vallign

Here is a snippet of the HTML table:

<table>
   <tbody>
      <tr bgcolor="#f01234" valign="top">
        <!--- td's follow ... -->
      </tr>
      <开发者_如何学编程;tr bgcolor="#c01234" valign="top">
        <!--- td's follow ... -->
      </tr>
   </tbody>
</table>

I've had a very quick look at BS's documentation. Its not clear what params to pass to findAll to match the rows I want.

Does anyone know what tp bass to findAll() to match the rows I want?


Don't use regex to parse html. Use a html parser

import lxml.html
doc = lxml.html.fromstring(your_html)
result = doc.xpath("//tr[(@bgcolor='#f01234' or @bgcolor='#c01234') "
    "and @valign='top']")
print result

That will extract all tr elements that match from your html, you can do further operation with them like change text, attribute value, extract, search further...

Obligatory link:

RegEx match open tags except XHTML self-contained tags


Something like

soup.findAll('tr', attrs={'bgcolor': re.compile(r'#f01234|#c01234'), 'valign': 'top'})

0

精彩评论

暂无评论...
验证码 换一张
取 消