I have some data that looks like this. It comes in chunk of four. Each chunk starts with a @
character.
@SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
AAAAAAAAAAAAAAAAAAAAAAAAAAA
+SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
::::::::::::::::::::::::;;8
@SRR037212.2 FC30L5TAA_102708:7:1:开发者_运维问答1045:1765 length=27
TATAACCAGAAAGTTACAAGTAAACAC
+SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27
888888888888888888888888888
At the third line of each chunk, I want to remove the text that comes after the +
character, resulting in:
@SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
AAAAAAAAAAAAAAAAAAAAAAAAAAA
+
::::::::::::::::::::::::;;8
@SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27
TATAACCAGAAAGTTACAAGTAAACAC
+
888888888888888888888888888
Is there a compact way to do that in sed or Perl?
Assuming you just don't want to blindly remove the rest of every line starting with a +
, then you can do this:
sed '/^@/{N;N;s/\n+.*/\n+/}' infile
Output
$ sed '/^@/{N;N;s/\n+.*/\n+/}' infile
@SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
AAAAAAAAAAAAAAAAAAAAAAAAAAA
+
::::::::::::::::::::::::;;8
@SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27
TATAACCAGAAAGTTACAAGTAAACAC
+
888888888888888888888888888
+Dont remove me
*Note: Although the above command keys on the @
to determine if a line with a +
should be altered, it will still alter the 2nd line if it happens to also start with a +
. It doesn't sound like this is the case, but if you want to exclude this corner case as well, the following minor alteration will protect against that:
sed '/^@/{N;N;s/\(.*\)\n+.*/\1\n+/}' infile
Output
$ sed '/^@/{N;N;s/\(.*\)\n+.*/\1\n+/}' ./infile
@SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
+AAAAAAAAAAAAAAAAAAAAAAAAAAA
+
::::::::::::::::::::::::;;8
@SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27
TATAACCAGAAAGTTACAAGTAAACAC
+
888888888888888888888888888
+Dont remove me
If there is never a + on the first or second lines and always one on the third line:
perl -0100pi -e's/\+.*/+/' datafile
Otherwise:
perl -0100pi -e's/^((?:.*\n){2}.*?\+).*/$1/' datafile
or on 5.10+:
perl -0100pi -e's/^(?:.*\n){2}.*?\+\K.*//' datafile
All those assume @ only appears at the start of a chunk. If it may appear other places, then:
perl -pi -e's/\+.*/+/ if $. % 4 == 3' datafile
If you can use awk, you can do:
gawk '{if ($0 ~ /^@/ ) { print ; getline ; print ; getline ; print "+"}}' INPUTFILE
So if gawk sees an @
at the start of the line, it will be printed, then the next line will be slurped && printed, and finally slurping the 3rd line (after the @
), and printing only the +
.
If the +
is not on the start of the line, you can use gensub(/\+.*/,"+",$0)
instead of the "+"
in the last print
.
(And if you have perl
installed, most probably there will be an a2p
executable, which can convert the above awk script to perl, if you want to...)
HTH
UPDATE (on missing 4th line):
gawk '{if ($0 ~ /^@/ ) { print ; getline ; print ; getline ; print "+"; getline; print }}' INPUTFILE
This should print the 4th line as well.
maybe just sed '/^@/+2 s/+.*/+/'
edit: this will not work, but as a vim command it should work:
vim file -c ':g/^@/+2s/+.*/+/' -c 'wq'
This might work for you:
sed '/^@/{$!N;$!N;$!N;s/\n+[^\n]*/\n+/g}' file
or with GNU sed:
sed '/^@/,+3s/^+.*/+/' file
精彩评论