python len function question_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-25 17:35 出处：网络

相关专题：python regex

I am a complete Python noob so please forgive my simple question. I am trying to write a script that will find all sequences in a huge string that match ATxxxCA, AT开发者_JAVA技巧xxxxCA, ATxxxxxCA, or ATxxxxxxCA where x can be any character. When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb

I attempted to do start the script like this:

SeqMatch = input("enter DNA sequence to search: ")
for s in re.findall(r'AT(.*?)CA', SeqMatch):
    if len(s) is < 10:
        print(s)
    else:
        print('no sequence matches')

I seem to be doing something wrong in my if loop? Can anyone help? Thanks in advance!

Take care of overlaps:

import re

adn = ('TCGCGCCCCCCCCCCATCAAGACATGGTTTTTTTTTTATTTATCAGATTACAGATACA'
       'GTTATGGGGGGGGGGATATACAGATGCATAGCGATTAGCCTAGCTA')


regx = re.compile('(.{10})(AT.{3,6}CA)(.{10})')
res = regx.findall(adn)
for u in res:
    print u

print

pat = re.compile('(.{10})(AT.{3,6}CA)')
li = []
for mat in pat.finditer(adn):
    x = mat.end()
    li.append(mat.groups()+(adn[x:x+10],))
for u in li:
    print u

result

('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')

('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('TTTTTTTTTT', 'ATTTATCA', 'GATTACAGAT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')

I seem to be doing something wrong in my if loop?

Python doesn't know what the meaning of "is" is (in this context).

Remove "is" from your if check,

if len(s) < 10:
    print(s)
else:
    print('no sequence matches')

You also said:

When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb

If you want to capture the preceding/and postceding(?) 10 characters, change your regex to

 (.{10})AT(.*)CA(.{10})

You'll get your 10 as in one result, followed by the stuff in between AT and CA, followed by your 10bs.

Or you can capture all of it by using one set of parethesis around the whole thing

 (.{10}AT.*CA.{10})

Regexpal is a godsend for creating/debugging regexs.

Here's an example:

s = "a"*20 + "ATxxxxCA" + "b"*20
rec = re.compile(r'(AT.{3,6}CA)')
mo = rec.search(s)
print s[mo.start()-10:mo.end()+10]

python len function question

精彩评论

关注公众号

热门标签

图文推荐

python len function question

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：