I have access to a web interface for a large amount of data. This data is usually accessed by people who only want a handful of items. The company that I work for wants me to download the whole set. Unfortunately, the interface only allows you to see fifty elements (of tens of thousands) at a time, and segregates the data into different folders.
Unfortunately, all of the data has the same url, which dynamically updates itself through ajax calls to an aspx interface. Writing a simple curl script to grab the data is difficult due to this and due to the authentication required.
How can I write a script that navigate开发者_C百科s around a page, triggers ajax requests, waits for the page to update, and then scrapes the data? Has this problem been solved before? Can anyone point me towards a toolkit?
Any language is fine, I have a good working knowledge of most web and scripting languages.
Thanks!
I usually just use a program like Fiddler or Live HTTP Headers and just watch what's happening behind the scenes. 99.9% of the time you'll see that there's a querystring or REST call with a very simple pattern that you can emulate.
If you need to directly control a browser
Have you thought of using tools like WatiN which are actually used for UI testing purposes but I suppose you could use it to programmaticly make requests anywhere and act upon responses.
If you just need to get the data
But since you can do whatever you please you can just make usual web requests from a desktop application and parse results. You could customize it to your own needs. And simulate AJax requests at will by setting certain request headers.
Maybe this ?
Website scraping using jquery and ajax
http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/
精彩评论