I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module 开发者_Go百科would do, but I didn't manage to find anything.
Can you help me? Thanks a lot :-)
Some of the emboss tools are a collection of small tools that can help you out.
seqstats
returns sequence lengthpepstats
should give you aminoacid content etc. Some of the tools also offer plotting functions. Very handy. http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/groups.html
To count number of fasta entries, I use:
grep -c '^>' mySequences.fasta
.
To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep '^>' mySequences.fasta | sort | uniq | wc -l
You may also be interested in faSize, which is a tool from the Kent Source Tree, although this requires a bit more effort (you must dload and compile) than just using grep... here is some example output:
me@my-lab ~/data $ time faSize myfile.fna
215400419 bases (104761 N's 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files
Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307
N count: mean 0.1 sd 0.4
U count: mean 294.3 sd 138.5
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real
real 0m3.710s
user 0m3.541s
sys 0m0.164s
Screed in python is brilliant:
import screed
for record in screed.open(fastafilename):
print(len(record.sequence))
It should be noted (for anyone stumbling upon this, like I just did) that there is a robust python library specifically designed to handle these tasks called Biopython. In a few lines of code, you can quickly access answers for all of the above questions. Here are some very basic examples, mostly adapted from the link. There are boiler-plate GC% graphs and sequence length graphs in the tutorial also.
In [1]: from Bio import SeqIO
In [2]: allSeqs = [seq_record for seq_record in SeqIO.parse('/home/kevin/stack/ls_orchid.fasta', """fasta""")]
In [3]: allSeqs[0]
Out[3]: SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533', name='gi|2765658|emb|Z78533.1|CIZ78533', description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[])
In [4]: len(allSeqs) #number of unique sequences in the file
Out[4]: 94
In [5]: len(allSeqs[0].seq) # call len() on each SeqRecord.seq object
Out[5]: 740
In [6]: A_count = allSeqs[0].seq.count('A')
C_count = allSeqs[0].seq.count('C')
G_count = allSeqs[0].seq.count('G')
T_count = allSeqs[0].seq.count('T')
print A_count # number of A's
144
In [7]: allSeqs[0].seq.count("AUG") # or count how many start codons
Out[7]: 0
In [8]: allSeqs[0].seq.translate() # translate DNA -> Amino Acid
Out[8]: Seq('RNKVSVGEPAEGSLMRPWNKRSSESGGPVYSAHRGHCSRGDPDLLLGRLGSVHG...*VY', HasStopCodon(ExtendedIUPACProtein(), '*'))
精彩评论