I'm scraping a page with Python's pyquery, and I'm kinda confused by the types it returns, and in particular how to iterate over a list of results.
If my HTML looks a bit like this:
<div class="formwrap">blah blah <h3>Something interesting</h3></div>
<div class="formwrap">more rubbish <h3>Something else interesting</h3></div>
How do I get the inside of the <h3>
tags, one by one so I can process them? I'm trying:
results_page = pq(response.read())
formwraps = results_page(".formwrap")
print type(formwraps)
print type([formwraps])
for my_div in [formwraps]:
print type(my_div)
print my_div("h3").text()
This produces:
<class 'pyquery.pyquery.PyQuery'>
<type 'list'>
<class 开发者_运维百科'pyquery.pyquery.PyQuery'>
Something interesting something else interesting
It looks like there's no actual iteration going on. How can I pull out each element individually?
Extra question from a newbie: what are the square brackets around [a]
doing? It looks like it converts a special Pyquery object to a list. Is []
a standard Python operator?
------UPDATE--------
I've found an 'each' function in the pyquery docs. However, I don't understand how to use it for what I want. Say I just want to print out the content of the <h3>
. This produces a syntax error: why?
formwraps.each(lambda e: print e("h3").text())
Since pyquery 1.2.3 (commit), you can use items()
of a PyQuery
object for going through each item as PyQuery
object:
print(type(formwraps.items()))
for my_div in formwraps.items():
print(my_div("h3").text())
The method items()
returns a generator
and this will work on both Python 2 and 3.
I think you can do something like this:
from pyquery import PyQuery as pq
def get_h3_contents(index, node):
d = pq(node)
d.find('h3').text()
formwraps.each(get_h3_contents)
Hope that helps someone if not the original poster.
I've never used pyquery, however the source of the syntax error is that lambdas in Python are kind of limited, you can only use one expresion inside (so no statements like print). You can circumvent this limitation using a function, e.g:
def my_print(x):
print x
formwraps.each(lambda e: my_print(e("h3").text()))
recent pyquery verions allow you to use .items()
[h.text() for h in formwraps('h3').items()]
i think you could iterate over pyquery like this:
for i in range(len(formwraps)):
print(formwraps.eq(i))
...
You can also do it without the each method:
from pyquery import PyQuery as pq
html = """
<div class="formwrap">blah blah <h3>Something interesting</h3></div>
<div class="formwrap">more rubbish <h3>Something else interesting</h3></div>
"""
formwraps = pq(html)(".formwrap")
for my_div in formwraps:
print pq(my_div)("h3").text()
It produces the following output:
Something interesting
Something else interesting
精彩评论