would anyone know how to select u开发者_如何学Pythonp to N (random, or the N first) rows for each unique values of a column, using unix command (or sed, awk etc)? Please no SQL as I don't know this langage.
Thank you very much for your help! Carole
here is an example input file:
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
6 00108.padded.fasta 2348
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
I would like to extract up to N rows (let's say up to 2 for this example) for each given unique value in column 2: expected output:
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
6 00108.padded.fasta 2348
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
Here I chose the first two rows, but it could be a randomly chosen pair of rows for each unique value in column 2.
This will return a constant number of rows (two in this case) for each unique value in column 2, but I'm pretty sure this isn't quite what you expected. Your input data is in the file 'test.txt'.
$ sort -k2 -u test.txt > a.tmp; sort a.tmp a.tmp
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
6 00108.padded.fasta 2348
6 00108.padded.fasta 2348
It's not clear what you expect if your input has only one row for a given unique value in column 2. If you still want two rows in the output, then this will work.
#!/bin/bash
# tested with bash 4
declare -A assoc
declare -a count
while read -r line
do
array=($line)
assoc[ ${array[0]} ]+="${array[@]}|"
done < file
OIFS=$IFS
IFS="|"
for i in "${!assoc[@]}"
do
count=(${assoc[$i]})
echo "${count[@]:0:2}"
done
IFS="$OLDIFS"
@carol, this is my output using your sample data. I use bash 4+. If you don't have it, then it won't work with associative arrays.
bash4> bash N.sh
6 00108.padded.fasta 2348 6 00108.padded.fasta 2348
3 00017.padded.fasta 1769 3 00017.padded.fasta 1769
5 00059.padded.fasta 2986 5 00059.padded.fasta 2986
here is a small script for your purpose:
#!/usr/bin/ksh
awk '{ print $0 >$2".yourfile"}' yourfile
for i in *.yourfile
do
awk 'NR<=2{print}' $i
done
rm -rf *.yourfile
if [ $? -eq 0 ]
then
echo "remove temp files successful"
fi
here is the execution:
torinoco!DBL:/oo_dgfqausr/test/dfqwrk12/vijay> script.sh
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
6 00108.padded.fasta 2348
remove temp files successful
torinoco!DBL:/oo_dgfqausr/test/dfqwrk12/vijay>
This is my file test.awk.
$1 >= n {$1 = n;}
$1 < n {$1 = $1;}
{
for (i = 1; i<= $1; i++) {
print $0;
}
}
Here's my version of test.txt. Assumes your test data is representative.
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
6 00108.padded.fasta 2348
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
1 00001.padded.fasta 1000
And this is the command line that gives you up to 'n' lines of output.
$ sort test.txt | uniq -c | awk -v n=2 -f test.awk | cut -f 1 -d " " --complement
1 00001.padded.fasta 1000
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
6 00108.padded.fasta 2348
To change the number of lines, change the value assigned to 'n'. n=4, n=3, etc.
精彩评论