开发者

How does data get processed across pipes?

开发者 https://www.devze.com 2023-02-06 04:14 出处:网络
I used this command-line program that I found in another post on SO describing how to spider a website.

I used this command-line program that I found in another post on SO describing how to spider a website.

wget --spider --force-html -r -l2 http://example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' > wget.out

When I crawl a large site, it takes a long time to finish. Meanwhile the wget.out file on disk shows zero size. So when does the piped data get processed and written to the file on disk? Is it after each stage in the pipe having run to completion? In that case, will wget.out fill up after the entire crawling is over?

How do I make the program write intermittently to disk, so that, even开发者_运维技巧 if the crawling stage is interrupted, I have some output saved ?


There is buffering in each pipe, and maybe in the stdio layers of each program. Data will not make it to the disk until the final grep has processed enough lines to cause its buffers to fill to the point of being spilled to disk.

If you run your pipeline on the command-line, and then hit Ctrl-C, sigint will be sent to every process, terminating each, and losing any pending output.

Either:

  1. Ignore sigint in all processes but the first. Bash hackery follows:

    $ wget --spider --force-html -r -l2 http://example.com 2>&1 grep '^--' |
        { trap '' int; awk '{ print $3 }'; } |
        ∶
    
  2. Simply deliver the keyboard interrupt to the first process. Interactively you can discover the pid with jobs -l and then kill that. (Run the pipeline in the background.)

    $ jobs -l
    [1]+ 10864 Running          wget
       3364 Running             | grep
      13500 Running             | awk
    ∶
    $ kill -int 10864
    
  3. Play around with the disown bash builtin.

0

精彩评论

暂无评论...
验证码 换一张
取 消