I purchased a script that scrapes some info on a HTML page in PHP (using regex on the HTML source), which works fine when the page has just HTML. However some pieces of information are populated by Ajax/Javascript, the scraping cannot get that information (only blanks are returned).
This is an example HTML source that I need to scrape, the {d10}{d1} etc is a timestamp. It is not instantiated when I grab the source:
layout: '<p><span>Time Remaining</span><br><strong>{d10}{d1} : {h10}{h1} : {m10}{m1} : {s10}{s1}</strong&g开发者_开发问答t;<br><span>Days Hours Mins Sec</span>
The function being called to get the HTML source is:
getContents($URL)
Is there any other way to get the HTML source from a URL that would have all the AJAX values rendered already? I read about "CURL()", would that get me the HTML source with values already populated by AJAX?
Thanks
you would need a scraper that can render javascript for that, not sure if there are any though. im sure spam would be on a whole new level if they could have bots scrape js.
Technically it is doable. You will have to parse out the url from the js code from where the xmlhttprequest data is requested. Then you can call this url using curl from php and parse the data. The challenge would be understand how the onload events are implemented in js and on what DOM nodes do they act.
If you can pin down the url structure of the ajax url(assuming there is one), then you can probably request data picking url params from the respective DOM elements.
精彩评论