开发者

split command question

开发者 https://www.devze.com 2023-03-27 06:31 出处:网络
I got problem with using split command. The input string is as follows: 080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:202783chr1042038185255 36M =42037995-225GCCAGGTTTAATAAATTATTTATAGAATACTGCATC@?DDEAEFD

I got problem with using split command. The input string is as follows:

080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10   42038185    255 36M =   42037995    -225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36

I want to grab '2027' from this string my command is: line.split(':',4)[1].split()[0] However, it doesn't work. The output is '1'

Then I switch to line.split(':',4) And output is still '1', and I see the first-step split is already problematic.

However, when I try line.split(':',1), I got expected result as:

1:8:1649:2027   83  chr10   42038185    255 36M =   42037995-225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?开发者_开发技巧CBA>A    NM:i:0  MD:Z:36

I'm confused by this split command! (I asked the similar question before, and split command worked at that time) thanks


It appears that what you want is

line.split(':',4)[4].split()[0]

The numeric parameter to split indicates the maximum number of splits that will occur. So you have:

>>> line='080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027 ...'
>>> line.split(':',4)
['080821_HWI-EAS301_0002_30ALBAAXX', '1', '8', '1649', '2027 ...']

If you pull element [1] out of this return value, you get '1'. I don't see why you are surprised by this.

Since you are allowing up to 4 splits, and the item you want will be the last one, the subscript you want is [4]:

>>> line.split(':',4)[4]
'2027 ...'

Then you can split that on space and get element [0] from it to produce your result.

You get the same result if you don't pass a split limit value at all:

>>> line.split(':')[4].split()[0]
'2027'


Try this:

#!/usr/bin/python

line = '080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10   42038185    255 36M =   42037995    -225    GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'

print line.split(':')[4].split()[0]

I'm not sure why you're trying to access the token containing 2027 like this:

line.split(':',4)

rather than this:

line.split(':')[4]

I think that you might be confused about how split works. The last parameter to the Python split function is the maximum number of splits to perform.


The second argument to split is the maximum number of splits to exercise, so you probably don't want to be using it in this case. To access the 5th element after performing the split, do this:

line.split(":")[4]

Anyway, what you probably want is to first split by whitespace (you can do this by using no arguments), and then split by colons. This can be done on one line like this:

line.split()[0].split(":")[4]


You can use instead:

s.split()[0].split(':')[4]


Split on the white space first. Then split the first element in the resultant list based on the separator (here: ':').

line.split()[0].split(':')[4]


Do you must use split?

I ask this because I've found regex to be a much better tool to use when I just need to grab a specific substring. It's not the easiest thing to learn and does appear very unapproachable at first, but you have to pay the price of learning it only once and it is an investment worth making. :)

Python homepage has a good introduction of it.

P.S. 2027 will be matched by the following regex .*?:([0-9]+)\s+


I presume that you will do numerous extractions of information from strings in the future. Then, my advice is to learn to use the regex tool, it will be inevitable.

Or you'll have to learn and use specialized library to do treatments of string in the field of genomics.

Simple solution to your present problem with module re :

line = '''080821_HWI-EAS301_0002_30ALBAAXX:1:8:1649:2027  83  chr10
42038185    255 36M =   42037995    -225
GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?
DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'''

import re

print re.search(':(\d+) ',line).group(1)

If there are blanks before the fourth ':' the regex's pattern will be:

line = '''080821_HWI-EAS301_0002_30AL BAAXX:1:8     :1649:2027  83  chr10
42038185    255 36M =   42037995    -225
GCCAGGTTTAATAAATTATTTATAGAATACTGCATC    @?DDEAEFDAD@FBG@CDA?
DBCDEECD@D?CBA>A    NM:i:0  MD:Z:36'''

import re

print re.search('(:[^:]+){3}:(\d+)',line).group(2)

(:[^:]+) matches a ':' followed by as many characters different from ':' that may follow

{3} says that this match must be performed 3 times

then the fourth ':' must be encountered, followed by the searched number matched by \d+ ; there is no more need to indicate that there must be a blank after the number, because \d+ will stop to match in the string as soon as a non-digit character will be encountered

Parentheseses define groups. Here the desired number is catched by the second group

0

精彩评论

暂无评论...
验证码 换一张
取 消