开发者

Bash : Cat based on array variable

开发者 https://www.devze.com 2023-04-01 16:19 出处:网络
I want to concatenate two or more files depending if there names contain or not elements from an array.

I want to concatenate two or more files depending if there names contain or not elements from an array.

I am reading this kind of file line by line (proteome.pisa):

2PJY_p  chain=(B C) hresname=() hresnumber=()   hatom=()    model=()    altconf=()
2Q7N_p  ch开发者_如何学JAVAain=(A E F G H I J K L)   hresname=(FUC MAN NAG)  hresnumber=()   hatom=()    model=()    altconf=()

For each line, the script extracts the string on the first column and defines it as the variable pdbid. Then it takes the second column and defines it as an array (chain of elements $c). Then it checks if a file called ${pdbid}_${c}_p.pdb exists and, if it does, it merges its content into the file ${pdbid}_p_${chains}.pdb

This is the script:

while read line ; do

echo "$line" > pdb.line
cut -f1 pdb.line > pdb.list
sed -i 's/.*/\"&\"/' pdb.list
sed -i 's/_p//g' pdb.list
awk '{ printf "pdbid="; print }' pdb.list > pdbid.list

cut -f2 pdb.line > chain.list

source pdbid.list
source chain.list

chains=`printf "%s" "${chain[@]}"`

for c in ${chain[@]} ; do
if [ ${#chain[@]} -gt 1 ] && \
   [ -f ${pdbid}_${c}_p.pdb ] ; then  
cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb
fi
done

done < proteome.pisa

The expected behaviour was to merge for instance, for the first row, 2PJY_p_B.pdb and 2PJY_p_C.pdb in a file called 2PJY_p_BC.pdb. However, what it actually does is merging the first file twice. I cannot understand why...


This is a great question, for it demonstrates that bash cannot do everything on its own. Instead, it needs helpers such as awk, cut, ... I looked through your solution and it seems after the two source lines, you expect to have variables pdbid, chain, and chains set. However, your script did not set them correctly and I can help with that part. I don't know Perl that much, but think Perl will work nicely in this case. Here is makevars.pl:

while (<STDIN>) {
    my($line) = $_;
    if ($line =~ /^(.*)_p.*chain=\((.*)\).*hresname.*$/) {
        print "pdbid=$1\n";
        print "chain=($2)\n";
        $chains = $2;
        $chains =~ s/ //g;
        print "chains=$chains\n";
    }
}

And here is the shell script:

while read line
do

    echo "$line" | perl makevars.pl >setvars.sh
    source setvars.sh
    # Now, pdbid, chain, and chains are set, do your things

done < proteome.pisa

I hope this helps.


I would suggest preprocessing the input into a simpler form with sed, then looping over that. This is assuming the chain=(...) is always the first such attribute on a line.

#!/bin/sh

# Replace 2ICQ_p chain=(A B C ... Z) attribs= ...   with
# 2ICQ_p A B C ... Z
sed 's/ chain=\(//;s/\).*//' <proteome.pisa |
while read pdbid chain; do
    chains=${chain/ /}
    for c in $chain; do
        test -e ${pdbid}_${c}_p.pdb || continue
        cat ${pdbdid}_${c}_p.pdb
    done >${pdbid}_p_${chains}.pdb
done

This avoids the use of temporary files which riddled your first script; sourcing a generated file also looks rather startling, if not alarming (usually you can use backticks for that sort of thing, but they are not really required here).

There are multiple variants of sed; some (e.g. Linux) want a literal parenthesis to be backslashed, others (e.g. Mac OSX) don't. If this doesn't work, try taking out the backslashes.

read with multiple variable names splits the input on whitespace so that the first variable name receives the first token, etc; the last named variable receives whatever is left, without additional whitespace splitting. continue jumps to the next iteration of the enclosing foror while loop. Other than that, this should be fairly self-explanatory. If you are really pressed to do it all in pure Bourne shell, the sed replacement at the beginning could probably be replaced with something involving string substitutions.


The problems appears to be the definition of the array in this line:

cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb

Changing it to :

cat ${pdbid}_${c}_p.pdb >> ${pdbid}_p_${chains}.pdb

appears to solve the problem.

In addition, I have double-quoted all occurrences of "${chain[@]}".

0

精彩评论

暂无评论...
验证码 换一张
取 消