I've got a website that I'd like to pull data from and it's really stuck in the stone ages. There's no web service, no API and it's very much an ASP/Session/table-based-layout page. Pretty fugly.
I'd like to just screen scrape it and开发者_JAVA百科 use js (coffeescript) to automate that. I wonder if this is possible. I could do this with C# and linqpad but then I'm stuck parsing the tables (and sub-tables and sub-sub-tables) with regex. Plus if I do it with js or coffeescript I'll get much more comfortable with those languages and I'll be able to use jQuery for pulling elements out of the DOM.
I see two possibilities here:
- use C# and find a library that will do things like Jquery but in C# code
- use coffeescript (js) and use jquery to find the elements that I'm looking for in the page
I'd also like to automate the page a bit (get next set of results). This is strictly for personal use -- I'm not pulling results of someone's search to use in my business. I just want to make a crappy search engine do what I want.
I wrote a class that allows you to supply a bunch of urls and a code block to scrape pages inside a chrome extension. You can find the github repo here: https://github.com/jkarmel/Executor. It could use some more testing and I need to work on the documentation, but it looks like it might be what you are looking for.
Here is how you would use it to get the all the links from a few different pages:
/*
* background.js by Jeremy Karmel.
*/
URLS = ['http://www.apple.com/',
'http://www.google.com/',
'http://www.facebook.com/',
'http://www.stanford.edu'];
//Function will be provided to exector to collect information
var getLinks = function() {
var links = [];
var numLinks = $('a');
$links.each(function(i, val) {links.push(val.href)});
var request = {data: links, url: window.location.href};
chrome.extension.sendRequest(request);
}
var main = function() {
var specForUsersTopics = {
urls : URLS,
code : getLinks,
callback : function(results) {
for (var url in results) {
console.log(url + ' has ' + results[url].length + ' links.');
var links = results[url];
for (var i = 0; i < links.length; i++)
console.log(' ' + links[i]);
}
console.log('all done!!!!');
}
};
var exec = Executor(specForUsersTopics);
exec.start();
}
main();
So basically the code to collect the links would be supplied to the executor instance and then you would do whatever you wanted with the results in the callback. It can deal with longish lists of url (~1000) and it will work on more than one at a time (default == 5). It doesn't handle errors in the code block very well right now, so be sure to test the code you are supplying.
I'm liking Curtain A) "use C# and find a library..."
"HTML Agility Pack" might be just what you're looking for:
http://htmlagilitypack.codeplex.com/
You can do it easily with Node.js, jsdom, and jQuery. See this tutorial (in JavaScript).
精彩评论