I am trying to scrape data from a website which uses javascript to load much of their content. Right now I am using jSoup to parse html pages, however since much of the content is loaded using javascript I haven't been able to parse the data I want.
How should I go about getting this javascript content? Should I first sa开发者_StackOverflow中文版ve the page then load and parse it using jSoup? If so, what should I use to load javascript content before I save? Is there an API which you would recommend that could output html?
Currently using java.
You might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context - among other things, you can define a "ready" function for the page and wait to scrape until the function (which might check for the existence of certain DOM elements, etc) returns true.
The other option, depending on the page, is to use a console like Firebug to figure out what data is being loaded (i.e. what URLs are being retrieved by the AJAX calls on the page), and scrape the data directly from those URLs.
If the data are generated with javascript then the data are in the downloaded page. Better is to directly parse them on the fly as you do with plain HTML or Text parsing. If you cannot isolate tokens with jSoup API just parse them using direct String options, as a plain text.
I tried using htmlUnit however I found it very slow.
I ended up using the curl command line function within java which worked for my purposes.
String command = "curl "+url;
Process p = Runtime.getRuntime().exec(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((s = stdInput.readLine()) != null) {
html = html+s+"\n";
}
return html;
精彩评论