Good library/platform for a real-time/parallel HTTP crawler? [closed]_问答_开发者

Good library/platform for a real-time/parallel HTTP crawler? [closed]

开发者 https://www.devze.com 2023-01-13 07:37 出处：网络

开发者_如何学JAVA As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references,or expertise, but this question will likely

开发者_如何学JAVA As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 9 years ago.

I am building a crawler that fetches information in parallel from a number of websites in real-time in response to a request for this information from a client. I need to request specific pages from 10-20 websites, parse their contents for specific snippets of information and return this information to the client as fast as possible. I want to do it asynchronously, so the client gets the first result displayed as soon as it is ready, while the other requests are still pending.

I have a Ruby background, and would therefore prefer to build the solution in a Ruby - however, parallelism and speed is exactly what Ruby is known NOT to excel at. I believe that libraries such as EventMachine and Typhoeus can remedy that, but I am also strongly considering node.js, because I know javascript quite well and seems to be built for this kind of thing.

Whatever I choose, I also need an efficient way to communicate the results back to the client. I am considering plain AJAX (but that would require polling the server), web sockets (but that would require fallback for older browsers) and specific solutions for persistent client/server communication such as Cramp, Juggernaut and Pusher.

Does anyone have any experience and/or recommendations they would like to share?

node is definitely capable of handling this type of task - async socket and http communciation is baked in and really pleasant to work with.

Most of my work is j/Ruby and I have found the transition to server-side JavaScript pretty painless - years of web dev mean I know js pretty well and the server development concepts are largely the same regardless of language.

In terms of communication Socket.io is an excellent client and server framework for handling socket communication in node - it supports flash, ajax and websocket channels which means it can be used on just about any modern (and some older) browsers.

If your crawler needs Javascript support, I recommend http://htmlunit.sourceforge.net/.
There is a JRuby wrapper available http://celerity.rubyforge.org/

Features (taken from site) include: