I have close to a million files over which I want to run a shell script and append the result to a single file.
For example suppos开发者_Go百科e I just want to run wc
on the files.
So that it runs fast I can parallelize it with xargs
. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat
them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?
Is there any utility that will recursively cat
two files in parallel?
I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.
GNU parallel claims that it:
makes sure output from the commands is the same output as you would get had you run the commands sequentially
If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel
handle the intermediate data.
Use the -k
option to maintain the order of the output.
Update: (non-Perl solution)
Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel
but should the the job for basic use cases.
The feature listing claims:
Does internal buffering and locking to prevent mangling/interleaving of output from separate jobs.
so it should meet your needs as long as order of output is not important
However, note on the following statement on this page:
prll generates a lot of status information on STDERR which makes it harder to use the STDERR output of the job directly as input for another program.
Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.
精彩评论