开发者

Trypsin digest (cleavage) does not work using regular expression

开发者 https://www.devze.com 2023-03-09 15:24 出处:网络
I have trying to code the theoretical tryptic cleavage of protein sequences in Python. The cleavage rule for trypsin is: after R or K, but not before P. (i.e. the trypsin cleaves (cuts) the protein se

I have trying to code the theoretical tryptic cleavage of protein sequences in Python. The cleavage rule for trypsin is: after R or K, but not before P. (i.e. the trypsin cleaves (cuts) the protein sequence after each K or R, unless (K or R) is followed by a P).

Example: Cleavage (cut) of the sequence MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK should result in these 4 sequences (peptides):

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK 

Note that there is no cleavage after K in the second peptide (because P comes after K) and there is no cleavage after R in the third peptide (because P comes after R).

I have written this code in Python, but it doesn't work well. Is there any way to implement this regular expression more meaningfully?

    # Open the file and read it line by line.

    myprotein = open(raw_input('Enter input filename: '),'r')
    if  os.path.exists("trypsin_digest.txt"):
        os.remove("trypsin_digest.txt")
    outfile = open("trypsin_digest.txt",'w+')

    for line in myprotein:
        protein = line.rstrip()
        protein = re.sub('(?<=[RK])(?=[^P])','', protein)

    for peptide in protein:
        outfile.write(peptide)
    print 'results written to:\n', os.getcwd() +'\ trypsin_digest.txt'

This is how I got it to work for me

   myprotein = open(raw_input('Enter input filename: '),'r')
   my_protein = []

   for protein in myprotein:
   myprotein = protein.rstrip('\n')
   my_protein.append(myprotein)
   my_pro = (''.join(my_protein))

   #cleaves sequence    
   peptides = re.sub(r'(?<=[RK])(?=[^P])','\n', my_pro)
   print peptides

Protein Sequence:

MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

Output(trypsin cleaved sites) or peptides

MVPPPPSR

GGAAKPGQLGR

SLGPLLLLLRPEEPEDGDR

EICSESK

MVPPPPSR

GGAAKPGQLGR

SLGPLLLLLRPEEPEDGDR

EICSESK

MVPPPPSR

GGAAKPGQLGR

SLGPLLLL开发者_开发知识库LRPEEPEDGDR

EICSESK


regexes are nice, but here's a solution that uses regular python. Since you're looking for subsequences in the bases, it makes sense to build this as a generator, which yields the fragments.

example = 'MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK'

def trypsin(bases):
    sub = ''
    while bases:
        k, r = bases.find('K'), bases.find('R')
        cut = min(k, r)+1 if k > 0 and r > 0 else max(k, r)+1
        sub += bases[:cut]
        bases = bases[cut:]
        if not bases or bases[0] != 'P':
            yield sub
            sub = ''


print list(trypsin(example))


EDIT With a slight modification your regex works well:

In your comment you mentioned you have multiple sequences in a file (let's call it sequences.dat):

$ cat sequences.dat
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

>>> with open('sequences.dat') as f:
    s = f.read()

>>> print(s)
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK
MVPPPPSRGGAAKPGQLGRSLGPLLLLLRPEEPEDGDREICSESK

>>> protein = re.sub(r'(?<=[RK])(?=[^P])','\n', s, re.DOTALL)

>>> protein.split()
['MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK', 'MVPPPPSR', 'GGAAKPGQLGR', 'SLGPLLLLLRPEEPEDGDR', 'EICSESK']

>>> print protein
MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK

MVPPPPSR
GGAAKPGQLGR
SLGPLLLLLRPEEPEDGDR
EICSESK


I believe the following regexp will do as you have described:

([KR]?[^P].*?[KR](?!P))

Result below from pythonregexp

>>> regex = re.compile("([KR]?[^P].*?[KR](?!P))")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xb1a9f49eb4111980>
>>> regex.match(string)
<_sre.SRE_Match object at 0xb1a9f49eb4102980>

# List the groups found
>>> r.groups()
(u'MVPPPPSR',)

# List the named dictionary objects found
>>> r.groupdict()
{}

# Run findall
>>> regex.findall(string)
[u'MVPPPPSR', u'GGAAKPGQLGR', u'SLGPLLLLLRPEEPEDGDR', u'EICSESK']
0

精彩评论

暂无评论...
验证码 换一张
取 消