I'm trying to scrape an asp.net page where I need to page through the items a list of items that are in a gridview control. I've never used asp.net but have been searching the Net for pointers but now I've hit a brick wall. The page links are of the form:
javascript:__doPostBack('ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems','Page$2')
I'm currently trying to get this working using Mechanize in Python. I initially tried the following, assuming that the VIEWSTATE variables would be handled by mechanize.
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()
Using a network monitor(Fiddler2), I noticed that two more variables were populated so I added these in too:
br.select_form(nr=0)
br.form.new_control('hidden','ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1',attrs = dict(name='ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'))
br.form.new_control('hidden','hiddenInputToUpdateATBuffer_CommonToolkitScripts',attrs = dict(name='hiddenInputToUpdateATBuffer_CommonToolkitScripts'))
br.form.new_control('hidden','__ASYNCPOST',attrs = dict(name='__ASYNCPOST'))
br.form.set_all_readonly(False)
br['hiddenInputToUpdateATBuffer_CommonToolkitScripts'] = '1'
br['__ASYNCPOST'] = 'TRUE'
br['ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$SearchResultsUpdatePanel|ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()
With both of these the html I get back is still for page 1 only.
I think there may be a couple of potential issues:
I'm not sure I'm doing the submit right. There are multiple submit buttons on the page so the one I'm searching for is the "search" button, which is what I previously used to get to the first page. I could see that being why the first page is displayed. If I use br.submit() without a name then it uses another submit control that takes you somewhere else.
When you click a page number in a browser, the gridview control updates without a page reload. As I'm not running Javascript, maybe I can开发者_StackOverflow中文版't get that but I would at least expect to be able to get back the data from the POST and parse that.
Any help would be much appreciated!
Managed to to it by building an xmlhttprequest per the answer here:
Using Python and Mechanize to submit form data and authenticate
精彩评论