How can I implement MapReduce using shell commands?_问答_开发者

How can I implement MapReduce using shell commands?

开发者 https://www.devze.com 2022-12-27 17:17 出处：网络

How do you execute a Unix shell comma开发者_JAVA技巧nd (e.g awk one liner) on a cluster in parallel (step 1) and collect the results back to a central node (step 2)?

Update: I've just found http://blog.last.fm/2009/04/06/mapreduce-bash-script It seems to do exactly what I need.

If all you're trying to do is fire off a bunch of remote commands, you could just use perl. You can "open" a ssh command and pipe the results back to perl. (You of course need to set up keys to allow password-less access)

open (REMOTE, "ssh user@hostB \"myScript\"|");
while (<REMOTE>)
{
  print $_;
}

You'd want to craft a loop with your machine names, and fire off one for each. After that just do non-blocking reads on the filehandles to pull back the data as it becomes available.

parallel can be installed on your central node and can be used to run a command across multiple machines.

In the example below, multiple ssh connections are used to run commands on the remote hosts. (-j is the number of jobs to run at the same time on the central node). The result can then be piped to commands to perform the "reduce" stage. (sort then uniq in this example).

parallel -j 50 ssh {} "ls" ::: host1 host2 hostn | sort | uniq -c

This example assumes "keyless ssh login" has been set up between the central node and all machines in the cluster.

It can be tricky to escape characters correctly when running more complex commands than "ls" remotely, you have to escape the escape character sometimes. You mention bashreduce, it may simplify this.