I am a complete Python noob so please forgive my simple question. I am trying to write a script that will find all sequences in a huge string that match ATxxxCA, AT开发者_JAVA技巧xxxxCA, ATxxxxxCA, or ATxxxxxxCA where x can be any character. When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb
I attempted to do start the script like this:
SeqMatch = input("enter DNA sequence to search: ")
for s in re.findall(r'AT(.*?)CA', SeqMatch):
if len(s) is < 10:
print(s)
else:
print('no sequence matches')
I seem to be doing something wrong in my if loop? Can anyone help? Thanks in advance!
Take care of overlaps:
import re
adn = ('TCGCGCCCCCCCCCCATCAAGACATGGTTTTTTTTTTATTTATCAGATTACAGATACA'
'GTTATGGGGGGGGGGATATACAGATGCATAGCGATTAGCCTAGCTA')
regx = re.compile('(.{10})(AT.{3,6}CA)(.{10})')
res = regx.findall(adn)
for u in res:
print u
print
pat = re.compile('(.{10})(AT.{3,6}CA)')
li = []
for mat in pat.finditer(adn):
x = mat.end()
li.append(mat.groups()+(adn[x:x+10],))
for u in li:
print u
result
('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')
('CCCCCCCCCC', 'ATCAAGACA', 'TGGTTTTTTT')
('TTTTTTTTTT', 'ATTTATCA', 'GATTACAGAT')
('GGGGGGGGGG', 'ATATACA', 'GATGCATAGC')
I seem to be doing something wrong in my if loop?
Python doesn't know what the meaning of "is" is (in this context).
Remove "is" from your if check,
if len(s) < 10:
print(s)
else:
print('no sequence matches')
You also said:
When the ATxxxCA pattern is matched, I would then like the script to then capture the previous 10 and next 10 characters surrounding the matched ATxxxCA. For example, the result might look like this: aaaaaaaaaaATxxxCAbbbbbbbbbb
If you want to capture the preceding/and postceding(?) 10 characters, change your regex to
(.{10})AT(.*)CA(.{10})
You'll get your 10 as in one result, followed by the stuff in between AT and CA, followed by your 10bs.
Or you can capture all of it by using one set of parethesis around the whole thing
(.{10}AT.*CA.{10})
Regexpal is a godsend for creating/debugging regexs.
Here's an example:
s = "a"*20 + "ATxxxxCA" + "b"*20
rec = re.compile(r'(AT.{3,6}CA)')
mo = rec.search(s)
print s[mo.start()-10:mo.end()+10]
精彩评论