Say we look at the first table in a page, so:
table = BeautifulSoup(...).table
the rows can be scanned with a clean for-loop:
for row in table:
f(row)
But for getting a single column things get messy.
My q开发者_JAVA百科uestion: is there an elegant way to extract a single column, either by its position, or by its 'name' (i.e. text that appears in the first row of this column)?
lxml is many times faster than BeautifulSoup, so you might want to use that.
from lxml.html import parse
doc = parse('http://python.org').getroot()
for row in doc.cssselect('table > tr'):
for cell in row.cssselect('td:nth-child(3)'):
print cell.text_content()
Or, instead of looping:
rows = [ row for row in doc.cssselect('table > tr') ]
cells = [ cell.text_content() for cell in rows.cssselect('td:nth-child(3)') ]
print cells
精彩评论