Run bash commands in parallel, track results and count_问答_开发者

I was wondering how, if possible, I can create a simple job management in BASH to process several commands in parallel. That is, I have a big list of commands to run, and I'd like to have two of them running at any given time.

I know quite a bit about bash, so here are the requirements that make it tricky:

The commands have variable running time so I can't just spawn 2, wait, and then continue with the next two. As soon as one command is done a next command must be run.
The controlling process needs to know the exit code of each command so that it can keep a total of how many failed

I'm thinking somehow I can use trap but I don't see an easy way to get the exit value of a child inside the handler.

So, any ideas on how this can be done?

Well, here is some proof of concept code that should probably work, but it breaks bash: invalid command lines generated, hanging, and sometimes a core dump.

# need monitor mode for trap CHLD to work
set -m
# store the PIDs of the children being watched
declare -a child_pids

function child_done
{
    echo "Child $1 result = $2"
}

function check_pid
{
    # check if running
    kill -s 0 $1
    if [ $? == 0 ]; then
        child_pids=("${child_pids[@]}" "$1")
    else
        wait $1
        ret=$?
        child_done $1 $ret
    fi
}

# check by copying pids, clearing list and then checking each, check_pid
# will add back to the list if it is still running
function check_done
{
    to_check=("${child_pids[@]}")
    child_pids=()

    for ((i=0;$i<${#to_check};i++)); do
        check_pid ${to_check[$i]}
    done
}

function run_command
{
    "$@" &
    pid=$!
    # check this pid now (this will add to the child_pids list if still running)
    check_pid $pid
}

# run check on all pids anytime some c开发者_StackOverflow中文版hild exits
trap 'check_done' CHLD

# test
for ((tl=0;tl<10;tl++)); do
    run_command bash -c "echo FAIL; sleep 1; exit 1;"
    run_command bash -c "echo OKAY;"
done

# wait for all children to be done
wait

Note that this isn't what I ultimately want, but would be groundwork to getting what I want.

Followup: I've implemented a system to do this in Python. So anybody using Python for scripting can have the above functionality. Refer to shelljob

GNU Parallel is awesomesauce:

$ parallel -j2 < commands.txt
$ echo $?

It will set the exit status to the number of commands that failed. If you have more than 253 commands, check out --joblog. If you don't know all the commands up front, check out --bg.

Can I persuade you to use make? This has the advantage that you can tell it how many commands to run in parallel (modify the -j number)

echo -e ".PHONY: c1 c2 c3 c4\nall: c1 c2 c3 c4\nc1:\n\tsleep 2; echo c1\nc2:\n\tsleep 2; echo c2\nc3:\n\tsleep 2; echo c3\nc4:\n\tsleep 2; echo c4" | make -f - -j2

Stick it in a Makefile and it will be much more readable

.PHONY: c1 c2 c3 c4
all: c1 c2 c3 c4
c1:
        sleep 2; echo c1
c2:
        sleep 2; echo c2
c3:
        sleep 2; echo c3
c4:
        sleep 2; echo c4

Beware, those are not spaces at the beginning of the lines, they're a TAB, so a cut and paste won't work here.

Put an "@" infront of each command if you don't the command echoed. e.g.:

        @sleep 2; echo c1

This would stop on the first command that failed. If you need a count of the failures you'd need to engineer that in the makefile somehow. Perhaps something like

command || echo F >> failed

Then check the length of failed.

The problem you have is that you cannot wait for one of multiple background processes to complete. If you observe job status (using jobs) then finished background jobs are removed from the job list. You need another mechanism to determine whether a background job has finished.

The following example uses starts to background processes (sleeps). It then loops using ps to see if they are still running. If not it uses wait to gather the exit code and starts a new background process.

#!/bin/bash

sleep 3 &
pid1=$!
sleep 6 &
pid2=$!

while ( true ) do
    running1=`ps -p $pid1 --no-headers | wc -l`
    if [ $running1 == 0 ]
    then
        wait $pid1
        echo process 1 finished with exit code $?
        sleep 3 &
        pid1=$!
    else
        echo process 1 running
    fi

    running2=`ps -p $pid2 --no-headers | wc -l`
    if [ $running2 == 0 ]
    then
        wait $pid2
        echo process 2 finished with exit code $?
        sleep 6 &
        pid2=$!
    else
        echo process 2 running
    fi
    sleep 1
done

Edit: Using SIGCHLD (without polling):

#!/bin/bash

set -bm
trap 'ChildFinished' SIGCHLD

function ChildFinished() {
    running1=`ps -p $pid1 --no-headers | wc -l`
    if [ $running1 == 0 ]
    then
        wait $pid1
        echo process 1 finished with exit code $?
        sleep 3 &
        pid1=$!
    else
        echo process 1 running
    fi

    running2=`ps -p $pid2 --no-headers | wc -l`
    if [ $running2 == 0 ]
    then
        wait $pid2
        echo process 2 finished with exit code $?
        sleep 6 &
        pid2=$!
    else
        echo process 2 running
    fi
    sleep 1
}

sleep 3 &
pid1=$!
sleep 6 &
pid2=$!

sleep 1000d

I think the following example answers some of your questions, I am looking into the rest of question

(cat list1 list2 list3 | sort | uniq > list123) &
(cat list4 list5 list6 | sort | uniq > list456) &

from:

Running parallel processes in subshells

There is another package for debian systems named xjobs.

You might want to check it out:

http://packages.debian.org/wheezy/xjobs

If you cannot install parallel for some reason this will work in plain shell or bash

# String to detect failure in subprocess
FAIL_STR=failed_cmd

result=$(
    (false || echo ${FAIL_STR}1) &
    (true  || echo ${FAIL_STR}2) &
    (false || echo ${FAIL_STR}3)
)
wait

if [[ ${result} == *"$FAIL_STR"* ]]; then
    failure=`echo ${result} | grep -E -o "$FAIL_STR[^[:space:]]+"`
    echo The following commands failed:
    echo "${failure}"
    echo See above output of these commands for details.
    exit 1
fi

Where true & false are placeholders for your commands. You can also echo $? along with the FAIL_STR to get the command status.

Yet another bash only example for your interest. Of course, prefer the use of GNU parallel, which will offer much more features out of the box.

This solution involve tmp file output creation for collecting of job status.

We use /tmp/${$}_ as temporary file prefix $$ is the actual parent process number and it is the same for all the script execution.

First, the loop for starting parallel job by batch. The batch size is set using max_parrallel_connection. try_connect_DB() is a slow bash function in the same file. Here we collect stdout + stderr 2>&1 for failure diagnostic.

nb_project=$(echo "$projects" | wc -w)
i=0
parrallel_connection=0
max_parrallel_connection=10
for p in $projects
do
  i=$((i+1))
  parrallel_connection=$((parrallel_connection+1))
  try_connect_DB $p "$USERNAME" "$pass" > /tmp/${$}_${p}.out 2>&1 &

  if [[ $parrallel_connection -ge $max_parrallel_connection ]]
  then
    echo -n " ... ($i/$nb_project)"
    wait
    parrallel_connection=0
  fi
done
if [[ $nb_project -gt $max_parrallel_connection ]]
then
  # final new line
  echo
fi

# wait for all remaining jobs
wait

After run all jobs is finished review all results:

SQL_connection_failed is our convention of error, outputed by try_connect_DB() you may filter job success or failure the way that most suite your need.

Here we decided to only output failed results in order to reduce the amount of output on large sized jobs. Especially if most of them, or all, passed successfully.

# displaying result that failed
file_with_failure=$(grep -l SQL_connection_failed /tmp/${$}_*.out)
if [[ -n $file_with_failure ]]
then
  nb_failed=$(wc -l <<< "$file_with_failure")
  # we will collect DB name from our output file naming convention, for post treatment
  db_names=""
  echo "=========== failed connections : $nb_failed/$nb_project"
  for failure in $file_with_failure
  do
    echo "============ $failure"
    cat $failure
    db_names+=" $(basename $failure | sed -e 's/^[0-9]\+_\([^.]\+\)\.out/\1/')"
  done
  echo "$db_names"
  ret=1
else
  echo "all tests passed"
  ret=0
fi

# temporary files cleanup, could be kept is case of error, adapt to suit your needs.
rm /tmp/${$}_*.out
exit $ret