I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating:
Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish.
Redirect following. I want each request to follow through redirects when a 302 status code is returned.
I came across two things which looked useful, but I couldn't use in the end开发者_StackOverflow:
http://zombie.labnotes.org/, but it doesn't have HTTPS support, so I can't use it.
http://www.phantomjs.org/, but couldn't use it because it doesn't (appear to) integrate with node.js. It's also pretty heavyweight for what I'm doing.
Are there any JavaScript screenscraping-esque libraries which propagate cookies, follow redirects, and support HTTPS? Any pointers on how to make this easier?
i actually have a scraper library now https://github.com/mikeal/spider it's quite nice, you can use jquery and routes.
feedback is welcome :)
You may want to check out https://github.com/mikeal/request from mikeal, I just spoke to him the chatroom and he says that it does not handle cookies at the moment but you can write a submodule to handle these for you in the meantime.
in regards to redirect it handles beautifully :)
It turns out someone made a phantomjs module for node.js:
https://github.com/sgentle/phantomjs-node
While phantom is fairly heavy, it also supports SSL, cookies, and everything else a typical browser supports (since it is a webkit browser, after all).
Give it a shot, it may be exactly what you are looking for.
精彩评论