For example... if I had a file like this:
A 16 chr11 36595888
A 0 chr1 155517200
B 16 chr1 43227072
C 0 chr20 55648508
D 0 chr2 52375454
D 16 chr2 73574214
D 0 chr3 93549403
E 16 chr3 3315671
I need to print only the lines which have a unique first column:
B 16开发者_高级运维 chr1 43227072
C 0 chr20 55648508
E 16 chr3 3315671
It's similar to awk '!_[$1]++'
, but I want to remove all lines which have non-unique fist field.
Bash and python solutions preferably.
in bash, assuming first column has fixed with (3):
sort input-file.txt | uniq -u -w 3
'-u' option prints only the unique lines and '-w 3' compares no more than the first 3 characters.
How about this:
#!/usr/bin/env python
from collections import defaultdict
data = defaultdict(list)
with open('file', 'rb') as f:
for line in sorted(f.readlines()):
data[line[0]].append(line)
for key in sorted(data.iterkeys()):
if len(data[key]) == 1:
print data[key]
awk '
{count[$1]++; line[$1]=$0}
END {for (val in count) if (count[val]==1) print line[val]}
' filename
That may alter the order of lines. If that's a problem, try this 2-pass approach:
awk '
NR==FNR {count[$1]++; next}
count[$1] == 1 {print}
' filename filename
sed one liner solution:
sed ':a;$bb;N;/^\(.\).*\n\1[^\n]*$/ba;:b;s/^\(.\).*\n\1[^\n]*\n*//;ta;/./P;D' file
in python, much easier to read and tweak:
d = dict()
for line in open('input-file.txt', 'r'):
key = line.split(' ', 1)[0]
d.setdefault(key, list()).append(line.rstrip())
for k, v in sorted(d.items()):
if len(v) == 1:
print v[0]
import sys
from collections import OrderedDict
lines = OrderedDict()
for line in sys.stdin:
field0 = line.strip().split('\t')[0]
lines[field0] = None if field0 in lines else line
for line in lines.values():
if line is not None:
sys.stdout.write(line)
If you don't care aout preserving order you could use a plain old dict ({}
) instead of OrderedDict
.
This implementation doesn't care if the duplicate fields are adjacent.
精彩评论