开发者

Get HTML source, including result of javascript and authentication

开发者 https://www.devze.com 2023-03-02 19:55 出处:网络
I am building a web scraper and need to get the html page source as it actually appears on the page. However, I only get a limited html source, one that does not include the needed info. I think that

I am building a web scraper and need to get the html page source as it actually appears on the page. However, I only get a limited html source, one that does not include the needed info. I think that I am either seeing it pre javascript loaded or else maybe I'm not getting the full info because I don't have the right authentication?? My result is the same as "view source" in Chrome when what I want is what Chrome's 'inspect element' shows. My test is cimber.dk after enterin开发者_高级运维g flight information and searching.

I am coding in python and tried the urllib2 library. Then I heard that Selenium was good for this so I tried that, too. However, that also gets me the same limited page source.

This is what I tried with urllib2 after using Firebug to see the parameters. (I deleted all my cookies after opening cimber.dk so I was starting with a 'clean slate')

url = 'https://www.cimber.dk/booking/'  
values = {'ARRANGE_BY' : 'D',...} #one for each value
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())  
#Using HTTPRedirectHandler instead of HTTPCookieProcessor gives the same.  
urllib2.install_opener(opener)  
request = urllib2.Request(url)  
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0')]  
request.add_header(....) # one for each header, also the cookie one
p = urllib.urlencode(values)  
data = opener.open(request, p).read() 
# data is now the limited source, like Chrome View Source 

#I tried to add the following in some vain attempt to do a redirect.  
#The result is always  "HTTP Error 400: Bad request"

f = opener.open('https://wftc2.e-travel.com/plnext/cimber/Override.action')  
data = f.read()  
f.close()


Most libraries like this do not support javascript.

If you want javascript, you will need to either automate an existing browser or browser engine, or get a really monolithic big beefy library that is essentially an advanced web crawler.

0

精彩评论

暂无评论...
验证码 换一张
取 消