I need to parse a preliminary GenBank Flatfile. The sequence hasn't been published yet, so I can't look it up by accession and download a FASTA file. I'm new to Bioinformatics, so could someone 开发者_JS百科show me where I could find a BioPerl or BioPython script to do this myself? Thanks!
You need the Bio::SeqIO module to read or write out bioinformatics data. The SeqIO HOWTO should tell you everything you need to know, but here's a small read-a-GenBank-file script in Perl to get you started!
I have the Biopython solution for you here. I will firstly assume your genbank file relates to a genome sequence, then I will provide a different solution assuming it was instead a gene sequence. Indeed it would have been helpful to have known which of these you are dealing with.
Genome Sequence Parsing:
Parse in your custom genbank flatfile from file by:
from Bio import SeqIO
record = SeqIO.read("yourGenbankFileDirectory/yourGenbankFile.gb","genbank")
If you just want the raw sequence then:
rawSequence = record.seq.tostring()
Now perhaps you need a name for this sequence, to give the sequence a ">header" before making the .fasta. Let's see what names came with the genbank .gb file:
nameSequence = record.features[0].qualifiers
This should return a dictionary with various synonyms of that whole sequence as annotated by author of that genbank file
Gene Sequence Parsing:
Parse in your custom genbank flatfile from file by:
from Bio import SeqIO
record = SeqIO.read("yourGenbankFileDirectory/yourGenbankFile.gb","genbank")
To get a list of raw sequences for the gene/list of all genes then:
rawSequenceList = [gene.extract(record.seq.tostring()) for gene in record.features]
To get a list of names for each gene sequence (more precisely a dictionary of synonyms for each gene)
nameSequenceList = [gene.qualifiers for gene in record.features]
精彩评论