Script for parsing a biological sequence from a public database in Python_问答_开发者

Script for parsing a biological sequence from a public database in Python

开发者 https://www.devze.com 2023-02-23 04:17 出处：网络

Greetings to the stackoverflow community, I am currently following a bioinformatics module as part of a biomedical degree (I am basically a python newbie) and the following task is required as part o

Greetings to the stackoverflow community,

I am currently following a bioinformatics module as part of a biomedical degree (I am basically a python newbie) and the following task is required as part of a Python programming assignment:

extract motif sequences (amino acid sequences, so basically strings in programmatic-speak, that have been excised from algorithms implementing a multiple sequence alignment and subsequently iterative database scanning to generate the best conserved sequences. The ultimate idea is to infer functional significance from such "motifs").

These motifs are stored on a public database in files which have multiple data fields corresponding to each protein (uniprot ID, Accession Number, the alignment itself stored in a hyperlink .seq file), currently one of which is of interest in this scope. The data field is called "extracted motif sets".

My question is how to go about writing a script that will essentially parse the "m开发者_运维知识库otif strings" and output them to a file. I have now coded the script so that it looks as follows (I don't write the results to files yet):

import os, re, sys, string 

printsdb = open('/users/spyros/folder1/python/PRINTSmotifs/prints41_1.kdat', 'r')

protname = None  
final_motifs = []

for line in printsdb.readlines():
 if line.startswith('gc;'):
        protname = line.lstrip()    
        #string.lower(name)  # convert to lowercase
        break

def extract_final_motifs(protname):

"""Extracts the sequences of the 'final motifs sets' for a PRINTS entry.
Sequences are on lines starting 'fd;' A simple regex is used for retrieval"""

for line in printsdb.readlines():
        if line.startswith('fd;'):
                final_motifs = re.compile('^\s+([A-Z]+)\s+<')
                final_motifs = final_motifs.match(line)
                #print(final_motifs.groups()[0])
                motif_dict = {protname : final_motifs}
                break 
return 

motif_dict = extract_final_motifs('ADENOSINER')
print(motif_dict)

The problem now is that while my code loops over a raw database file (prints41_!.kdat) instead of connecting to the public database using urllib module, as suggested by Simon Cockell below, the ouput of the script is simply "none" on the python shell, whereas it should be creating a list such as [AAYIGIEVLI, AAYIGIEVLI, AAYIGIEVLI, etc..]

Does anybody have any idea where the logic error is? Any input appreciated!! I apologize for the extensive text, I just hope to be a clear as possible. Thanks in advance for any help!

First of what you are doing is almost right but you have to change "extracted motif sets" lien 2 to a variable say line . What the for loop will do is to return data form the file line by line as the variable after for this case line. And now comes the question how the lysozyme.seq file is formated. its sounds like that none of the data fields have any spacing. Then that means you might get away whit doing line.split(" ") or line.split("\t") \t meas tab. the split will do what it says it dose split the string every time it sees a " " or "\t" depending on what you write in the program.