I want to fetch specific rows in an HTML document
The rows have the following attributes set: bgcolor and vallign
Here is a snippet of the HTML table:
<table>
<tbody>
<tr bgcolor="#f01234" valign="top">
<!--- td's follow ... -->
</tr>
<开发者_如何学编程;tr bgcolor="#c01234" valign="top">
<!--- td's follow ... -->
</tr>
</tbody>
</table>
I've had a very quick look at BS's documentation. Its not clear what params to pass to findAll to match the rows I want.
Does anyone know what tp bass to findAll() to match the rows I want?
Don't use regex to parse html. Use a html parser
import lxml.html
doc = lxml.html.fromstring(your_html)
result = doc.xpath("//tr[(@bgcolor='#f01234' or @bgcolor='#c01234') "
"and @valign='top']")
print result
That will extract all tr elements that match from your html, you can do further operation with them like change text, attribute value, extract, search further...
Obligatory link:
RegEx match open tags except XHTML self-contained tags
Something like
soup.findAll('tr', attrs={'bgcolor': re.compile(r'#f01234|#c01234'), 'valign': 'top'})
精彩评论