开发者

Iterating over objects in pyquery

开发者 https://www.devze.com 2023-01-07 03:26 出处:网络
I\'m scraping a page with Python\'s pyquery, and I\'m kinda confused by the types it returns, and in particular how to iterate over a list of results.

I'm scraping a page with Python's pyquery, and I'm kinda confused by the types it returns, and in particular how to iterate over a list of results.

If my HTML looks a bit like this:

<div class="formwrap">blah blah <h3>Something interesting</h3></div>
<div class="formwrap">more rubbish <h3>Something else interesting</h3></div>

How do I get the inside of the <h3> tags, one by one so I can process them? I'm trying:

results_page = pq(response.read())
formwraps = results_page(".formwrap") 
print type(formwraps)
print type([formwraps])
for my_div in [formwraps]:
    print type(my_div)
    print my_div("h3").text() 

This produces:

<class 'pyquery.pyquery.PyQuery'>
<type 'list'>
<class 开发者_运维百科'pyquery.pyquery.PyQuery'>
Something interesting something else interesting

It looks like there's no actual iteration going on. How can I pull out each element individually?

Extra question from a newbie: what are the square brackets around [a] doing? It looks like it converts a special Pyquery object to a list. Is [] a standard Python operator?

------UPDATE--------

I've found an 'each' function in the pyquery docs. However, I don't understand how to use it for what I want. Say I just want to print out the content of the <h3>. This produces a syntax error: why?

formwraps.each(lambda e: print e("h3").text())


Since pyquery 1.2.3 (commit), you can use items() of a PyQuery object for going through each item as PyQuery object:

print(type(formwraps.items()))
for my_div in formwraps.items():
    print(my_div("h3").text())

The method items() returns a generator and this will work on both Python 2 and 3.


I think you can do something like this:

from pyquery import PyQuery as pq

def get_h3_contents(index, node):
    d = pq(node)
    d.find('h3').text()

formwraps.each(get_h3_contents)

Hope that helps someone if not the original poster.


I've never used pyquery, however the source of the syntax error is that lambdas in Python are kind of limited, you can only use one expresion inside (so no statements like print). You can circumvent this limitation using a function, e.g:

def my_print(x):
    print x

formwraps.each(lambda e: my_print(e("h3").text()))


recent pyquery verions allow you to use .items()

[h.text() for h in formwraps('h3').items()]


i think you could iterate over pyquery like this:

for i in range(len(formwraps)):
    print(formwraps.eq(i))
    ...


You can also do it without the each method:

from pyquery import PyQuery as pq
html = """
<div class="formwrap">blah blah <h3>Something interesting</h3></div>
<div class="formwrap">more rubbish <h3>Something else interesting</h3></div>
"""
formwraps = pq(html)(".formwrap")

for my_div in formwraps:
    print pq(my_div)("h3").text()

It produces the following output:

Something interesting
Something else interesting
0

精彩评论

暂无评论...
验证码 换一张
取 消