开发者

How to get the PID of a process in a pipeline

开发者 https://www.devze.com 2023-01-09 15:12 出处:网络
Consider the following simplified example: my_prog|awk 开发者_开发知识库\'...\' > output.csv &

Consider the following simplified example:


my_prog|awk 开发者_开发知识库'...' > output.csv &
my_pid="$!" #Gives the PID for awk instead of for my_prog
sleep 10
kill $my_pid #my_prog still has data in its buffer that awk never saw. Data is lost!

In bash, $my_pid points to the PID for awk. However, I need the PID for my_prog. If I kill awk, my_prog does not know to flush it's output buffer and data is lost. So, how would one obtain the PID for my_prog? Note that ps aux|grep my_prog will not work since there may be several my_prog's going.

NOTE: changed cat to awk '...' to help clarify what I need.


Just had the same issue. My solution:

process_1 | process_2 &
PID_OF_PROCESS_2=$!
PID_OF_PROCESS_1=`jobs -p`

Just make sure process_1 is the first background process. Otherwise, you need to parse the full output of jobs -l.


I was able to solve it with explicitly naming the pipe using mkfifo.

Step 1: mkfifo capture.

Step 2: Run this script


my_prog > capture &
my_pid="$!" #Now, I have the PID for my_prog!
awk '...' capture > out.csv & 
sleep 10
kill $my_pid #kill my_prog
wait #wait for awk to finish.

I don't like the management of having a mkfifo. Hopefully someone has an easier solution.


Here is a solution without wrappers or temporary files. This only works for a background pipeline whose output is captured away from stdout of the containing script, as in your case. Suppose you want to do:

cmd1 | cmd2 | cmd3 >pipe_out &
# do something with PID of cmd2

If only bash could provide ${PIPEPID[n]}!! The replacement "hack" that I found is the following:

PID=$( { cmd1 | { cmd2 0<&4 & echo $! >&3 ; } 4<&0 | cmd3 >pipe_out & } 3>&1 | head -1 )

If needed, you can also close the fd 3 (for cmd*) and fd 4 (for cmd2) with 3>&- and 4<&-, respectively. If you do that, for cmd2 make sure you close fd 4 only after you redirect fd 0 from it.


Add a shell wrapper around your command and capture the pid. For my example I use iostat.

#!/bin/sh
echo $$ > /tmp/my.pid
exec iostat 1

Exec replaces the shell with the new process preserving the pid.

test.sh | grep avg

While that runs:

$ cat my.pid 
22754
$ ps -ef | grep iostat
userid  22754  4058  0 12:33 pts/12   00:00:00 iostat 1

So you can:

sleep 10
kill `cat my.pid`

Is that more elegant?


Improving @Marvin's and @Nils Goroll's answers with a oneliner that extract the pids for all commands in the pipe into a shell array variable:

# run some command
ls -l | rev | sort > /dev/null &

# collect pids
pids=(`jobs -l % | egrep -o '^(\[[0-9]+\]\+|    ) [ 0-9]{5} ' | sed -e 's/^[^ ]* \+//' -e 's! $!!'`)

# use them for something
echo pid of ls -l: ${pids[0]}
echo pid of rev: ${pids[1]}
echo pid of sort: ${pids[2]}
echo pid of first command e.g. ls -l: $pids
echo pid of last command e.g. sort: ${pids[-1]}

# wait for last command in pipe to finish
wait ${pids[-1]}

In my solution ${pids[-1]} contains the value normally available in $!. Please note the use of jobs -l % which outputs just the "current" job, which by default is the last one started.

Sample output:

pid of ls -l: 2725
pid of rev: 2726
pid of sort: 2727
pid of first command e.g. ls -l: 2725
pid of last command e.g. sort: 2727

UPDATE 2017-11-13: Improved the pids=... command that works better with complex (multi-line) commands.


Based on your comment, I still can't see why you'd prefer killing my_prog to having it complete in an orderly fashion. Ten seconds is a pretty arbitrary measurement on a multiprocessing system whereby my_prog could generate 10k lines or 0 lines of output depending upon system load.

If you want to limit the output of my_prog to something more determinate try

my_prog | head -1000 | awk

without detaching from the shell. In the worst case, head will close its input and my_prog will get a SIGPIPE. In the best case, change my_prog so it gives you the amount of output you want.

added in response to comment:

In so far as you have control over my_prog give it an optional -s duration argument. Then somewhere in your main loop you can put the predicate:

if (duration_exceeded()) {
    exit(0);
}

where exit will in turn properly flush the output FILEs. If desperate and there is no place to put the predicate, this could be implemented using alarm(3), which I am intentionally not showing because it is bad.

The core of your trouble is that my_prog runs forever. Everything else here is a hack to get around that limitation.


With inspiration from @Demosthenex's answer: using subshells:

$ ( echo $BASHPID > pid1; exec vmstat 1 5 ) | tail -1 & 
[1] 17371
$ cat pid1
17370
$ pgrep -fl vmstat
17370 vmstat 1 5


I was desperately looking for good solution to get all the PIDs from a pipe job, and one promising approach failed miserably (see previous revisions of this answer).

So, unfortunately, the best I could come up with is parsing the jobs -l output using GNU awk:

function last_job_pids {
    if [[ -z "${1}" ]] ; then
        return
    fi

    jobs -l | awk '
        /^\[/ { delete pids; pids[$2]=$2; seen=1; next; }
        // { if (seen) { pids[$1]=$1; } }
        END { for (p in pids) print p; }'
}


My solution was to query jobs and parse it using perl.
Start two pipelines in the background:

$ sleep 600 | sleep 600 |sleep 600 |sleep 600 |sleep 600 &
$ sleep 600 | sleep 600 |sleep 600 |sleep 600 |sleep 600 &

Query background jobs:

$ jobs
[1]-  Running                 sleep 600 | sleep 600 | sleep 600 | sleep 600 | sleep 600 &
[2]+  Running                 sleep 600 | sleep 600 | sleep 600 | sleep 600 | sleep 600 &

$ jobs -l
[1]-  6108 Running                 sleep 600
      6109                       | sleep 600
      6110                       | sleep 600
      6111                       | sleep 600
      6112                       | sleep 600 &
[2]+  6114 Running                 sleep 600
      6115                       | sleep 600
      6116                       | sleep 600
      6117                       | sleep 600
      6118                       | sleep 600 &

Parse the jobs list of the second job %2. The parsing is probably error prone, but in these cases it works. We aim to capture the first number followed by a space. It is stored into the variable pids as an array using the parenthesis:

$ pids=($(jobs -l %2 | perl -pe '/(\d+) /; $_=$1 . "\n"'))
$ echo $pids
6114
$ echo ${pids[*]}
6114 6115 6116 6117 6118
$ echo ${pids[2]}
6116
$ echo ${pids[4]}
6118

And for the first pipeline:

$ pids=($(jobs -l %1 | perl -pe '/(\d+) /; $_=$1 . "\n"'))
$ echo ${pids[2]}
6110
$ echo ${pids[4]}
6112

We could wrap this into a little alias/function:

function pipeid() { jobs -l ${1:-%%} | perl -pe '/(\d+) /; $_=$1 . "\n"'; }
$ pids=($(pipeid))     # PIDs of last job
$ pids=($(pipeid %1))  # PIDs of first job

I have tested this in bash and zsh. Unfortunately, in bash I could not pipe the output of pipeid into another command. Probably because that pipeline is ran in a sub shell not able to query the job list??

0

精彩评论

暂无评论...
验证码 换一张
取 消