I am using RCurl
in R to try and download data from a website, but I'm having trouble finding out what URL to use. He开发者_开发知识库re is the site:
http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX
See how in the upper right, above the displayed sheet, there's a link to download the data as a .csv
file? I was wondering if there was a way to find a regular HTTP address for that .csv
file, because RCurl
can't handle the Javascript commands.
I will give you a quick and dirty way to get the data. Firstly you can use Fiddler2 http://www.fiddler2.com/fiddler2/ to inspect the POST that your browser sends. This results in the following POST:
POST http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX HTTP/1.1
Host: www.invescopowershares.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Referer: http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX
Content-Type: application/x-www-form-urlencoded
Content-Length: 70669
__EVENTTARGET=ctl00%24MainPageLeft%24MainPageContent%24ExportHoldings1%24LinkButton1&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTE1OTcxNjYzNw9kFgJmD2QWBAIDD2QWBAIDD2QWCAIBDw9kFgQeC2........
So we can see that 3 parameters are being posted namely __EVENTTARGET, __EVENTVALIDATION and __VIEWSTATE.
The required form for the postForm call would be:
postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
Now comes the quick and dirty bit. I would just open a browser and get the relevant parameters that it recieves as follows:
library(rcom)
ie = comCreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
ie$Navigate2("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX")
while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){
Sys.sleep(1)
print(comGetProperty(ie,"ReadyState"))
}
myDoc<-comGetProperty(ie,"Document")
myPW<-comGetProperty(myDoc,"parentWindow")
comInvoke(myPW,"execScript","var dumVar1=theForm.__EVENTVALIDATION.value;var dumVar2=theForm.__VIEWSTATE.value;","JavaScript")
event.val<-myPW[["dumVar1"]]
view.state<-myPW[["dumVar2"]]
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
ie$Quit()
ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')
> fin.data[1,]
ticker SecurityNum Name CouponRate maturitydate
1 PGX 949746879 WELLS FARGO & COMPANY PFD 0.08
rating Shares PercentageOfFund PositionDate
1 BBB+/Baa3 2538656 0.04442112 06/11/2012
__EVENTVALIDATION, __VIEWSTATE maybe be always the same or they maybe session cookies. You could probably get them using RCurl but as I say this is the quick and dirty solution and we just take the ones that Internet Explorer is given. Things to note:
1). This requires windows with IE installed to use the rcom bits.
2). If you are running ie9 you may need to add invescopowershares.com to the Compatibility View Settings (as microsoft seems to have blocked event.val<-myPW[["dumVar1"]] type com calls)
EDIT (UPDATE)
Having looked thru the website in more detail __EVENTVALIDATION, __VIEWSTATE are being set as javascript variables on the initial page. We can just parse these in a quick and dirty fashion as follows without resorting to calling a browser.
dum<-getURL("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX")
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
event.val<-unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2]
event.val<-unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1]
view.state<-unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2]
view.state<-unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1]
ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')
The above should work cross platform.
Clicking on the Download link executes this piece of JavaScript:
__doPostBack('ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1','')
That __doPostBack
function appears to simply fill in a couple of hidden form fields on that page then submit a POST request.
A quick googling shows that RCurl is capable of submitting a POST request. So, what you would need to do is look in the source of that page, find the form with name "aspnetForm", take all the fields from that form, and create your own POST request that submits the fields to the action URL (http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX).
Can't guarantee this will work, though. There appears to be a hidden form field named __VIEWSTATE
that appears to encode some information, and I don't know how this factors in.
This is definitely the way to get the .csv file in RCurl, but I can't figure out what form fields I want to use in getForm to make it work. Should I use the fields from the doPostBack command that's attached to the "Download" link on the page, or should I use the fields from aspnetForm on the source page. Just for reference, the aspnetForm field we're interested in is:
" form name="aspnetForm" method="post" action="holdings.aspx?ticker=PGX" id="aspnetForm" style="margin:0px" "
... and the postForm request I just tried that didn't work was
postForm("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX", "form name" = "aspnetForm", "method" = "post", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm", "style" = "margin:0px")
Thanks for all of the help!
There is now a function in the qmao package that will do this for you. (It is based on code from a now deleted answer to this question.)
You can use the dlPowerShares function like this:
require("qmao")
Symbol <- "PGX"
dat <- qmao:::dlPowerShares(event.target = "ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1",
action = paste0("holdings.aspx?ticker=", Symbol))
> head(dat)
ticker SecurityNum Name CouponRate maturitydate rating Shares PercentageOfFund PositionDate
1 PGX 173080201 CITIGROUP CAPITAL XIII 0.07875 10/30/2040 BB/Ba2 2998647 0.04274939 08/31/2012
2 PGX 949746879 WELLS FARGO & COMPANY PFD 0.08000 BBB+/Baa3 2549992 0.03935854 08/31/2012
3 PGX 06739H362 BARCLAYS BK PLC 0.08125 A-/Baa3 2757635 0.03644835 08/31/2012
4 PGX 46625H621 JPMORGAN CHASE 0.08625 BBB+/Baa1 2416021 0.03310707 08/31/2012
5 PGX 060505765 BANK OF AMERICA CORP PFD 8.2 0.08200 BB+/B1 2345508 0.03128002 08/31/2012
6 PGX 060505559 BANC OF AMERICA CORP PFD 8.625 0.08625 BB+/B1 2259484 0.03001599 08/31/2012
In the code above, event.target
is the first string inside of the javascript:__doPostBack()
function that you will get when you right click the “Download” link and “Copy Link Address”.
action
is the product-specific part of the action url.
Internally, the code follows Jeff's suggestion in his
answer and searches the source of the page for the values
of the fields for the "aspnetForm". It then uses those values in a call to postForm
(from the RCurl package.)
In the qmao package, dlPowerShares
is
used by getHoldings.powershares
. Also, getHoldings
will call getHoldings.powershares
if one of the Symbols
passed to it is the symbol of a PowerShares ETF.
p.s. if qmao:::dlPowerShares
is called with its defaults, it will download the PowerShares product list from
http://www.invescopowershares.com/products/
精彩评论