开发者

python len function question

开发者 https://www.devze.com 2023-02-25 17:35 出处:网络
I am a complete Python noob so please forgive my simple question. I am trying to write a script that will find all sequences in a huge string that match ATxxxCA, AT开发者_JAVA技巧xxxxCA, ATxxxxxCA, or

I am a complete Python noob so please forgive my simple question. I am trying to write a script that will find all sequences in a huge string that match ATxxxCA, AT开发者_JAVA技巧xxxxCA, ATxxxxxCA, or ATxxxxxxCA where x can be any character. When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb

I attempted to do start the script like this:

SeqMatch = input("enter DNA sequence to search: ")
for s in re.findall(r'AT(.*?)CA', SeqMatch):
    if len(s) is < 10:
        print(s)
    else:
        print('no sequence matches')

I seem to be doing something wrong in my if loop? Can anyone help? Thanks in advance!


Take care of overlaps:

import re

adn = ('TCGCGCCCCCCCCCCATCAAGACATGGTTTTTTTTTTATTTATCAGATTACAGATACA'
       'GTTATGGGGGGGGGGATATACAGATGCATAGCGATTAGCCTAGCTA')


regx = re.compile('(.{10})(AT.{3,6}CA)(.{10})')
res = regx.findall(adn)
for u in res:
    print u

print

pat = re.compile('(.{10})(AT.{3,6}CA)')
li = []
for mat in pat.finditer(adn):
    x = mat.end()
    li.append(mat.groups()+(adn[x:x+10],))
for u in li:
    print u

result

('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')

('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('TTTTTTTTTT', 'ATTTATCA', 'GATTACAGAT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')


I seem to be doing something wrong in my if loop?

Python doesn't know what the meaning of "is" is (in this context).

Remove "is" from your if check,

if len(s) < 10:
    print(s)
else:
    print('no sequence matches')

You also said:

When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb

If you want to capture the preceding/and postceding(?) 10 characters, change your regex to

 (.{10})AT(.*)CA(.{10})

You'll get your 10 as in one result, followed by the stuff in between AT and CA, followed by your 10bs.

Or you can capture all of it by using one set of parethesis around the whole thing

 (.{10}AT.*CA.{10})

Regexpal is a godsend for creating/debugging regexs.


Here's an example:

s = "a"*20 + "ATxxxxCA" + "b"*20
rec = re.compile(r'(AT.{3,6}CA)')
mo = rec.search(s)
print s[mo.start()-10:mo.end()+10]
0

精彩评论

暂无评论...
验证码 换一张
取 消