I've run into a problem with mechanize following links. Here's a snippet of what I'm aiming to do:
for link in mech.links(url_regex='/test/'):
mech.follow_link(link)
// Do some processing on that link
mech.back()
According to mechanize examples, this should work just fine. However it doesn't. Despite calling .back(), the loop ends, even though there are more links to visit. If I comment out mech.follow_link(link) and mech.back(), replacing them with print link.text, it will print out all开发者_运维问答 50 or so links. However...as soon as I uncomment mech.follow_link, the loop immediately terminates after the first follow_link. back() is working, in that if I print mech.title(), then call mech.back() and print mech.title() again, it clearly shows the first title, then the 'back' page's title. I'm really confused, and this is how it's done in the docs. Not sure what's going on.
Pirate, I agree, this shouldn't be happening, you're doing pretty much exactly what the documentation page at wwwsearch.sourceforge.net/mechanize/ says; I tried code similar to yours and got the same result where it stopped after the first iteration.
I did, however, find a work-around, namely to save the link URLs from links() into a list, and then follow each URL from that list:
from mechanize import Browser
br = Browser()
linklist = []
br.open(your_page_here)
for link in br.links(url_regex='/test/'): linklist.append(link.url)
for url in linklist:
br.open(url)
print br.title()
It's ugly and you shouldn't have to do it, but it seems to work.
I'm not really thrilled with mechanize for bugginess like this (and a problem I had with mechanize handling two submit buttons poorly), but it's very simple to install, seems pretty portable, and can run offline (via simple cron jobs) easily compared to other testing frameworks like Selenium (seleniumhq dot org), which looks great but seems a lot more involved to actually set up and use.
A much more straightforward workaround than saving the link list is to simply get a second Browser object. This can be considered equivalent to opening a second tab in a "real" browser. If you also need authentication, you will need to share a cookie jar between browser instances:
import mechanize
import cookielib
br = mechanize.Browser()
br2 = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br2.set_cookiejar(cj)
br.open("http://yoursite.com/login")
br.select_form(nr=0)
br["username"] = "..." # The hash keys are the names of the form fields
br["password"] = "..."
br.submit() # This will save the authentication cookie to the shared cookie jar!
br.open("http://yoursite.com/page-to-parse")
for link in br.links(url_regex="/link_text"):
req = br.click_link(url=link.url)
html = br2.open(req).read()
Note that it is necessary to get a request object from the first instance, and then submitting it with the second. This is equivalent to the command "Open in a new window/tab" in a "real" browser.
every page visit resets the links() iterator to the links on that new page. you therefore need to save it to a separate variable, e.g.: links = mech.links()
, or as Chirael indicated, links = list(mech.links())
, which has the advantage of being able to be counted with print >>sys.stderr, '# links: %d' % len(links)
. this is not a bug in mechanize.Browser, it's just a side effect of having a stateful object.
another gotcha I noticed while playing with this is that you cannot use mech.back()
if mech.request
was not set from the beginning, as it would not be if mech.set_response()
had been used to set the original page content. in that case you have to explicitly set the first request to something: mech.request = mechanize.Request('about://config')
. otherwise you get a BrowserStateError: already at start of history
.
and for the sake of completeness, if someone is coming here from a Google search, as I did, make sure to set the headers in mechanize.make_response
to, at minimum, (('content-type', 'text/html'),)
or mech.viewing_html
will remain False
and mech.links()
will raise BrowserStateError("not viewing HTML")
.
精彩评论