开发者

bash gnu parallel help

开发者 https://www.devze.com 2023-03-15 05:14 出处:网络
its about http://en.wikipedia.org/wiki/Parallel_(software) and very rich manpage http://www.gnu.org/software/parallel/man.html

its about http://en.wikipedia.org/wiki/Parallel_(software) and very rich manpage http://www.gnu.org/software/parallel/man.html

(for x in `cat list` ; do 
       do_something $x
   done) | process_output

is replaced by this

 cat list | parallel do_something | process_output

i am trying to implement that on this

    while [ "$n" -开发者_StackOverflow社区gt 0 ]
        do          
        percentage=${"scale=2;(100-(($n / $end) * 100))"|bc -l}}
    #get url from line specified by n from file done1              
nextUrls=`sed -n "${n}p" < done1`
    echo -ne "${percentage}%  $n / $end urls saved going to line 1. current: $nextUrls\r"
#    function that gets links from the url
    getlinks $nextUrls
#save n
    echo $n > currentLine
    let "n--"
    let "end=`cat done1 |wc -l`"
    done

while reading documentation for gnu parallel

i found out that functions are not supported so getlinks wont be used in parallel

best i have found so far is

seq 30 | parallel -n 4 --colsep '  ' echo {1} {2} {3} {4}

makes output

1 2 3 4 
5 6 7 8 
9 10 11 12 
13 14 15 16 
17 18 19 20 
21 22 23 24 
25 26 27 28 
29 30 

while loop mentioned above should go like this if I am right

end=`cat done1 |wc -l`
seq $end -1 1 |  parallel -j+4 -k
#(all exept getlinks function goes here, but idk how? )|
# everytime it finishes do
 getlinks $nextUrls

thx for help in advance


It seems what you want is a progress meter. Try:

cat done1 | parallel --eta wget

If that is not what you want, look at sem (sem is an alias for parallel --semaphore and is normally installed with GNU Parallel):

for i in `ls *.log` ; do
  echo $i
  sem -j+0 gzip $i ";" echo done
done
sem --wait

In your case it will be something like:

while [ "$n" -gt 0 ]
    do          
    percentage=${"scale=2;(100-(($n / $end) * 100))"|bc -l}}
    #get url from line specified by n from file done1
    nextUrls=`sed -n "${n}p" < done1`
    echo -ne "${percentage}%  $n / $end urls saved going to line 1. current: $nextUrls\r"
    #    function that gets links from the url
    THE_URL=`getlinks $nextUrls`
    sem -j10 wget $THE_URL
    #save n
    echo $n > currentLine
    let "n--"
    let "end=`cat done1 |wc -l`"
done
sem --wait
echo All done


Why does getlinks need to be a function? Take the function and transform it into a shell script (should be essentially identical except you need to export environmental variables in and you of course cannot affect the outside environment without lots of work).

Of course, you cannot save $n into currentline when you are trying to execute in parallel. All files will be overwriting each other at the same time.


i was thinking of makeing something more like this, if not parallel or sam something else because parallel does not supprot funcitons aka http://www.gnu.org/software/parallel/man.html#aliases_and_functions_do_not_work

getlinks(){
if [ -n "$1" ]
then
    lynx -image_links -dump "$1" > src
    grep -i ".jpg" < src > links1
    grep -i "http"  < links1 >links  
    sed -e  's/.*\(http\)/http/g'  < links >> done1
    sort -f done1 > done2
    uniq done2 > done1
    rm -rf links1 links src done2 
fi
}
func(){
 percentage=${"scale=2;(100-(($1 / $end) * 100))"|bc -l}}
        #get url from line specified by n from file done1
        nextUrls=`sed -n "${$1}p" < done1`
        echo -ne "${percentage}%  $n / $end urls saved going to line 1. current: $nextUrls\r"
        #    function that gets links from the url
        getlinks $nextUrls
        #save n
        echo $1 > currentLine
        let "$1--"
        let "end=`cat done1 |wc -l`"
}
while [ "$n" -gt 0 ]
    do          
   sem -j10 func $n
done
sem --wait
echo All done

My script has become really complex, and i do not want to make a feature unavailable with something i am not sure it can be done this way i can get links with full internet traffic been used, should take less time that way


tryed sem

#!/bin/bash
func (){
echo 1
echo 2
}


for i in `seq 10`
do
sem -j10 func 
done
sem --wait
echo All done

you get

errors

Can't exec "func": No such file or directory at /usr/share/perl/5.10/IPC/Open3.p
m line 168.
open3: exec of func  failed at /usr/local/bin/sem line 3168  


It is not quite clear what the end goal of your script is. If you are trying to write a parallel web crawler, you might be able to use the below as a template.

#!/bin/bash

# E.g. http://gatt.org.yeslab.org/
URL=$1
# Stay inside the start dir
BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
URLLIST=$(mktemp urllist.XXXX)
URLLIST2=$(mktemp urllist.XXXX)
SEEN=$(mktemp seen.XXXX)

# Spider to get the URLs
echo $URL >$URLLIST
cp $URLLIST $SEEN

while [ -s $URLLIST ] ; do
  cat $URLLIST |
    parallel lynx -listonly -image_links -dump {} \; wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
    perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' |
    grep -F $BASEURL |
    grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
  mv $URLLIST2 $URLLIST
done

rm -f $URLLIST $URLLIST2 $SEEN
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号