select up to N (random, or the N first) rows for each unique values of a column using unix or awk (no sql)_问答_开发者

select up to N (random, or the N first) rows for each unique values of a column using unix or awk (no sql)

开发者 https://www.devze.com 2023-02-26 19:28 出处：网络

would anyone know how to select u开发者_如何学Pythonp to N (random, or the N first) rows for each unique values of a column, using unix command (or sed, awk etc)? Please no SQL as I don\'t know this l

would anyone know how to select u开发者_如何学Pythonp to N (random, or the N first) rows for each unique values of a column, using unix command (or sed, awk etc)? Please no SQL as I don't know this langage.

Thank you very much for your help! Carole

here is an example input file:

5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
6   00108.padded.fasta  2348
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769

I would like to extract up to N rows (let's say up to 2 for this example) for each given unique value in column 2: expected output:

5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
6   00108.padded.fasta  2348
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769

Here I chose the first two rows, but it could be a randomly chosen pair of rows for each unique value in column 2.

This will return a constant number of rows (two in this case) for each unique value in column 2, but I'm pretty sure this isn't quite what you expected. Your input data is in the file 'test.txt'.

$ sort -k2 -u test.txt > a.tmp; sort a.tmp a.tmp
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
6   00108.padded.fasta  2348
6   00108.padded.fasta  2348

It's not clear what you expect if your input has only one row for a given unique value in column 2. If you still want two rows in the output, then this will work.

#!/bin/bash
# tested with bash 4
declare -A assoc
declare -a count
while read -r line
do
    array=($line)
    assoc[ ${array[0]} ]+="${array[@]}|"
done < file
OIFS=$IFS
IFS="|"
for i in "${!assoc[@]}"
do
    count=(${assoc[$i]})
    echo "${count[@]:0:2}"
done
IFS="$OLDIFS"

@carol, this is my output using your sample data. I use bash 4+. If you don't have it, then it won't work with associative arrays.

bash4> bash N.sh
6 00108.padded.fasta 2348 6 00108.padded.fasta 2348
3 00017.padded.fasta 1769 3 00017.padded.fasta 1769
5 00059.padded.fasta 2986 5 00059.padded.fasta 2986

here is a small script for your purpose:

#!/usr/bin/ksh

awk '{ print $0 >$2".yourfile"}' yourfile
for i in *.yourfile
do
awk 'NR<=2{print}' $i
done
rm -rf *.yourfile
if [ $? -eq 0 ]
then
echo "remove temp files successful"
fi

here is the execution:

torinoco!DBL:/oo_dgfqausr/test/dfqwrk12/vijay> script.sh
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
6   00108.padded.fasta  2348
remove temp files successful
torinoco!DBL:/oo_dgfqausr/test/dfqwrk12/vijay>

This is my file test.awk.

$1 >= n {$1 = n;}
$1 <  n {$1 = $1;}
{
    for (i = 1; i<= $1; i++) {
      print $0;
    }
}

Here's my version of test.txt. Assumes your test data is representative.

5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
5   00059.padded.fasta  2986
6   00108.padded.fasta  2348
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769
3   00017.padded.fasta  1769
1   00001.padded.fasta  1000

And this is the command line that gives you up to 'n' lines of output.

$ sort test.txt | uniq -c | awk -v n=2 -f test.awk  | cut -f 1 -d " " --complement
1 00001.padded.fasta 1000
3 00017.padded.fasta 1769
3 00017.padded.fasta 1769
5 00059.padded.fasta 2986
5 00059.padded.fasta 2986
6 00108.padded.fasta 2348

To change the number of lines, change the value assigned to 'n'. n=4, n=3, etc.