开发者

Scraping non html-websites with R?

开发者 https://www.devze.com 2023-04-06 05:30 出处:网络
Scraping data from html tables from html websites is cool and easy. However, how can I do this task if the website is not written in html and requires a browser to show the relevant information, e.g.

Scraping data from html tables from html websites is cool and easy. However, how can I do this task if the website is not written in html and requires a browser to show the relevant information, e.g. if it's an asp website or the data is not in the code but comes in through java code?

Like it is here: http://www.bwea.com/ukwed/construction.asp.

With VBA for excel one can write a function that opens and IE session calling the website an开发者_高级运维d then basically copy and pasting the content of the website. Any chance to do something similar with R?


This is normal HTML, with the associated normal trouble of having to clean up after scraping the data.

The following does the trick:

  • Read the page with readHTMLTable in package XML
  • It's the fifth table on the page, so extract the fifth element
  • Take the first row and assign it to the names of the table
  • Delete the first row

The code:

x <- readHTMLTable("http://www.bwea.com/ukwed/construction.asp", 
                   as.data.frame=TRUE, stringsAsFactors=FALSE)
dat <- x[[5]]
names(dat) <- unname(unlist(dat[1, ]))

The resulting data:

dat <- dat[-1, ]

'data.frame':   39 obs. of  10 variables:
 $ Date                : chr  "September 2011" "August 2011" "August 2011" "August 2011" ...
 $ Wind farm           : chr  "Baillie Wind farm - Bardnaheigh Farm" "Mains of Hatton" "Coultas Farm" "White Mill (Coldham ext)" ...
 $ Location            : chr  "Highland" "Aberdeenshire" "Nottinghamshire" "Cambridgeshire" ...
 $ Power(MW)           : chr  "2.5" "0.8" "0.33" "2" ...
 $ Turbines            : chr  "21" "3" "1" "7" ...
 $ MW Capacity         : chr  "52.5" "2.4" "0.33" "14" ...
 $ Annual homes equiv*.: chr  "29355" "1342" "185" "7828" ...
 $ Developer           : chr  "Baillie" "Eco2" "" "COOP" ...
 $ Latitude            : chr  "58 02 52N" "57 28 11N" "53 04 33N" "52 35 47N" ...
 $ Longitude           : chr  "04 07 40W" "02 30 32W" "01 18 16W" "00 07 41E" ...


That site just delivers HTML, as Thomas comments. Some sites use JavaScript to get values via an AJAX call and insert them into the document dynamically - those won't work via a simple scraping. The trick with those is to use a JavaScript debugger to see what the AJAX calls are and reverse engineer them from the Request and Response.

The hardest thing will be sites driven by Java Applets, but thankfully these are rare. These could be getting their data via just about any network mechanism, and you'd have to reverse engineer all that by inspecting network traffic.

Even IE/VBA will fail if its a Java applet, I reckon.

Also, don't confuse java and JavaScript.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号