开发者

How do I remove part of a line in a multi-line chunk using sed or Perl?

开发者 https://www.devze.com 2023-02-07 12:40 出处:网络
I have some data that looks like this. It comes in chunk of four. Each chunk starts with a @ character.

I have some data that looks like this. It comes in chunk of four. Each chunk starts with a @ character.

@SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
AAAAAAAAAAAAAAAAAAAAAAAAAAA
+SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
::::::::::::::::::::::::;;8
@SRR037212.2 FC30L5TAA_102708:7:1:开发者_运维问答1045:1765 length=27
TATAACCAGAAAGTTACAAGTAAACAC
+SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27
888888888888888888888888888

At the third line of each chunk, I want to remove the text that comes after the + character, resulting in:

@SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
AAAAAAAAAAAAAAAAAAAAAAAAAAA
+
::::::::::::::::::::::::;;8
@SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27
TATAACCAGAAAGTTACAAGTAAACAC
+
888888888888888888888888888

Is there a compact way to do that in sed or Perl?


Assuming you just don't want to blindly remove the rest of every line starting with a +, then you can do this:

sed '/^@/{N;N;s/\n+.*/\n+/}' infile

Output

$ sed '/^@/{N;N;s/\n+.*/\n+/}' infile
@SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
AAAAAAAAAAAAAAAAAAAAAAAAAAA
+
::::::::::::::::::::::::;;8
@SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27
TATAACCAGAAAGTTACAAGTAAACAC
+
888888888888888888888888888
+Dont remove me

*Note: Although the above command keys on the @ to determine if a line with a + should be altered, it will still alter the 2nd line if it happens to also start with a +. It doesn't sound like this is the case, but if you want to exclude this corner case as well, the following minor alteration will protect against that:

sed '/^@/{N;N;s/\(.*\)\n+.*/\1\n+/}' infile

Output

$ sed '/^@/{N;N;s/\(.*\)\n+.*/\1\n+/}' ./infile
@SRR037212.1 FC30L5TAA_102708:7:1:741:1355 length=27
+AAAAAAAAAAAAAAAAAAAAAAAAAAA
+
::::::::::::::::::::::::;;8
@SRR037212.2 FC30L5TAA_102708:7:1:1045:1765 length=27
TATAACCAGAAAGTTACAAGTAAACAC
+
888888888888888888888888888
+Dont remove me


If there is never a + on the first or second lines and always one on the third line:

perl -0100pi -e's/\+.*/+/' datafile

Otherwise:

perl -0100pi -e's/^((?:.*\n){2}.*?\+).*/$1/' datafile

or on 5.10+:

perl -0100pi -e's/^(?:.*\n){2}.*?\+\K.*//' datafile

All those assume @ only appears at the start of a chunk. If it may appear other places, then:

perl -pi -e's/\+.*/+/ if $. % 4 == 3' datafile


If you can use awk, you can do:

 gawk '{if ($0 ~ /^@/ ) { print ; getline ; print ; getline ; print "+"}}' INPUTFILE

So if gawk sees an @ at the start of the line, it will be printed, then the next line will be slurped && printed, and finally slurping the 3rd line (after the @), and printing only the +.

If the + is not on the start of the line, you can use gensub(/\+.*/,"+",$0) instead of the "+" in the last print.

(And if you have perl installed, most probably there will be an a2p executable, which can convert the above awk script to perl, if you want to...)

HTH

UPDATE (on missing 4th line):

 gawk '{if ($0 ~ /^@/ ) { print ; getline ; print ; getline ; print "+"; getline; print }}' INPUTFILE

This should print the 4th line as well.


maybe just sed '/^@/+2 s/+.*/+/'

edit: this will not work, but as a vim command it should work:

vim file -c ':g/^@/+2s/+.*/+/' -c 'wq'


This might work for you:

sed '/^@/{$!N;$!N;$!N;s/\n+[^\n]*/\n+/g}' file

or with GNU sed:

sed '/^@/,+3s/^+.*/+/' file
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号