I want to concatenate two or more files depending if there names contain or not elements from an array.
I am reading this kind of file line by line (proteome.pisa):
2PJY_p chain=(B C) hresname=() hresnumber=() hatom=() model=() altconf=()
2Q7N_p ch开发者_如何学JAVAain=(A E F G H I J K L) hresname=(FUC MAN NAG) hresnumber=() hatom=() model=() altconf=()
For each line, the script extracts the string on the first column and defines it as the variable pdbid. Then it takes the second column and defines it as an array (chain of elements $c). Then it checks if a file called ${pdbid}_${c}_p.pdb exists and, if it does, it merges its content into the file ${pdbid}_p_${chains}.pdb
This is the script:
while read line ; do
echo "$line" > pdb.line
cut -f1 pdb.line > pdb.list
sed -i 's/.*/\"&\"/' pdb.list
sed -i 's/_p//g' pdb.list
awk '{ printf "pdbid="; print }' pdb.list > pdbid.list
cut -f2 pdb.line > chain.list
source pdbid.list
source chain.list
chains=`printf "%s" "${chain[@]}"`
for c in ${chain[@]} ; do
if [ ${#chain[@]} -gt 1 ] && \
[ -f ${pdbid}_${c}_p.pdb ] ; then
cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb
fi
done
done < proteome.pisa
The expected behaviour was to merge for instance, for the first row, 2PJY_p_B.pdb and 2PJY_p_C.pdb in a file called 2PJY_p_BC.pdb. However, what it actually does is merging the first file twice. I cannot understand why...
This is a great question, for it demonstrates that bash cannot do everything on its own. Instead, it needs helpers such as awk, cut, ... I looked through your solution and it seems after the two source lines, you expect to have variables pdbid, chain, and chains set. However, your script did not set them correctly and I can help with that part. I don't know Perl that much, but think Perl will work nicely in this case. Here is makevars.pl:
while (<STDIN>) {
my($line) = $_;
if ($line =~ /^(.*)_p.*chain=\((.*)\).*hresname.*$/) {
print "pdbid=$1\n";
print "chain=($2)\n";
$chains = $2;
$chains =~ s/ //g;
print "chains=$chains\n";
}
}
And here is the shell script:
while read line
do
echo "$line" | perl makevars.pl >setvars.sh
source setvars.sh
# Now, pdbid, chain, and chains are set, do your things
done < proteome.pisa
I hope this helps.
I would suggest preprocessing the input into a simpler form with sed
, then looping over that. This is assuming the chain=(...)
is always the first such attribute on a line.
#!/bin/sh
# Replace 2ICQ_p chain=(A B C ... Z) attribs= ... with
# 2ICQ_p A B C ... Z
sed 's/ chain=\(//;s/\).*//' <proteome.pisa |
while read pdbid chain; do
chains=${chain/ /}
for c in $chain; do
test -e ${pdbid}_${c}_p.pdb || continue
cat ${pdbdid}_${c}_p.pdb
done >${pdbid}_p_${chains}.pdb
done
This avoids the use of temporary files which riddled your first script; sourcing a generated file also looks rather startling, if not alarming (usually you can use backticks for that sort of thing, but they are not really required here).
There are multiple variants of sed
; some (e.g. Linux) want a literal parenthesis to be backslashed, others (e.g. Mac OSX) don't. If this doesn't work, try taking out the backslashes.
read
with multiple variable names splits the input on whitespace so that the first variable name receives the first token, etc; the last named variable receives whatever is left, without additional whitespace splitting. continue
jumps to the next iteration of the enclosing for
or while
loop. Other than that, this should be fairly self-explanatory. If you are really pressed to do it all in pure Bourne shell, the sed
replacement at the beginning could probably be replaced with something involving string substitutions.
The problems appears to be the definition of the array in this line:
cat ${pdbid}_${chain[$c]}_p.pdb >> ${pdbid}_p_${chains}.pdb
Changing it to :
cat ${pdbid}_${c}_p.pdb >> ${pdbid}_p_${chains}.pdb
appears to solve the problem.
In addition, I have double-quoted all occurrences of "${chain[@]}".
精彩评论