I have a tab-delimited text file with 8 columns:
Erythropoietin Receptor Integrin Beta 4 11.7 9.7 164 195 19 3.2
Erythropoietin Receptor Receptor Tyrosine Phosphatase F 10.8 2.6 97 107 15 3.2
Erythropoietin Receptor Leukemia Inhibitory Factor Receptor 12.0 3.6 171 479 14 3.2
Erythropoietin Receptor Immunoglobulin 9 10.4 3.1 100 108 24 3.3
Erythropoietin Receptor Collagen Alpha 1 Xx 10.7 2.7 93 105 18 3.3
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 5 11.4 3.2 114 114 25 1.7
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 14 11.1 2.1 99 100 28 1.8
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 1B 10.9 4.9 133 162 29 1.9
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 11A 11.5 5.1 130 166 25 1.9
The first and second column contain protein names and the 8th column contains the "distance" score between each protein pair. I would like to remove the lines containing duplicate protein pairs and keep only the pair with the lowest distance (the lowest value in the 8th column). This means that for the pair Protein A-Protein B I would like to remove all occurrences except the one with the lowest distance score. The pair is considered duplicate even if the protein names are swapped (in different columns). This means that Protein A Protein B is th开发者_StackOverflow中文版e same as Protein B Protein A.
Something like this (untested):
awk -F'\t' 'END {
for (r in rec) print rec[r]
}
{
if (mina[$1, $2] < $NF || minb[$2, $1] < $NF) {
mina[$1, $2] = $NF; minb[$2, $1] = $NF
rec[$1, $2] = $0
}
}' infile
I hope this would be the final update ^_^
kent$ awk -F'\t' '{if($1$2 in a){
if($8<a[$1$2]){
a[$1$2]=$8;r[$1$2]=$0;
}
}else if ($2$1 in a){
if($8<a[$2$1]){
a[$2$1] = $8;r[$2$1] = $0;
}
}else{
a[$1$2]=$8; r[$1$2]=$0;
}
} END{for(x in r)print r[x]}' yourFile
精彩评论