开发者

Extracting data from javascript webpages

开发者 https://www.devze.com 2023-02-28 18:14 出处:网络
I need to build a system to extract vast amounts of data from a collection of webpages. A lot of these sites (mayabe 90% or so) are powered by various different javascript systems. I am wondering what

I need to build a system to extract vast amounts of data from a collection of webpages. A lot of these sites (mayabe 90% or so) are powered by various different javascript systems. I am wondering what is the most efficient method to extract this data?

Since every site is different I am looking for a flexible solution, and since there are many sites I am looking for a solution that'll put as little stress on my network as possible.

Most of my programming experience is in C, C++ and Perl, but I'm happy to whatever gives the best开发者_运维百科 result.

The webpages have constantly updating numbers and statistics that I wish to extract and perform some analysis on, so I need to be able to easily store them in a database.

I've done some research of my own, but I'm really coming up blank here. I'm hoping someone else can help me! :)


You will need a browser that interprets the JavaScript, and does the actual requests for you. You will then need to take a DOM snapshot of the interpreted result. It's not going to be trivial, and it's going to be impossible in pure PHP.

I have no own experience with it, but maybe the Selenium Suite can help. It's an automation suite used for software testing, but according to this article, to some extent can also be used for scraping.


Maybe you should try PHP DOMDocument class. For example this code will "steal" all the table tags from the url.

$data=array();    
$url='your.site.com';
$out=file_get_contents($url);
$dom=new DOMDocument();
$dom->loadHTML($out);
foreach($dom->getElementsByTagName('table') as $table){
data[]=$table->nodeValue;
}
print_r($data);

You can take and manipulate all the DOM and parse all the html document. Consider calling this script asynchronously with an AJAX approach.

0

精彩评论

暂无评论...
验证码 换一张
取 消