How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fe开发者_Python百科tch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.
Analogously, is there a programmatic python interface for getting phastCons scores?
thanks.
Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO
See my answer to your question over at Biostar:
http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc
Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:
region = rec.seq[start:end]
to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.
Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.
pyfasta is the module you're looking for. From the description
fast, memory-efficient, pythonic (and command-line) access to fasta sequence files
https://github.com/brentp/pyfasta
精彩评论