multiFASTA file processing_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-12 14:26 出处：网络

I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module 开发者_Go百科would do, but I didn't manage to find anything.

Can you help me? Thanks a lot :-)

Some of the emboss tools are a collection of small tools that can help you out.

seqstats returns sequence length
pepstats should give you aminoacid content etc. Some of the tools also offer plotting functions. Very handy. http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/groups.html

To count number of fasta entries, I use: grep -c '^>' mySequences.fasta.

To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep '^>' mySequences.fasta | sort | uniq | wc -l

You may also be interested in faSize, which is a tool from the Kent Source Tree, although this requires a bit more effort (you must dload and compile) than just using grep... here is some example output:

me@my-lab ~/data $ time faSize myfile.fna
215400419 bases (104761 N's 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files
Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307
N count: mean 0.1 sd 0.4
U count: mean 294.3 sd 138.5
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real

real    0m3.710s
user    0m3.541s
sys     0m0.164s

Screed in python is brilliant:

import screed

for record in screed.open(fastafilename):
    print(len(record.sequence))

It should be noted (for anyone stumbling upon this, like I just did) that there is a robust python library specifically designed to handle these tasks called Biopython. In a few lines of code, you can quickly access answers for all of the above questions. Here are some very basic examples, mostly adapted from the link. There are boiler-plate GC% graphs and sequence length graphs in the tutorial also.

In [1]: from Bio import SeqIO

In [2]: allSeqs = [seq_record for seq_record in SeqIO.parse('/home/kevin/stack/ls_orchid.fasta', """fasta""")]

In [3]: allSeqs[0]
Out[3]: SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533', name='gi|2765658|emb|Z78533.1|CIZ78533', description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[])

In [4]: len(allSeqs) #number of unique sequences in the file
Out[4]: 94

In [5]: len(allSeqs[0].seq) # call len() on each SeqRecord.seq object
Out[5]: 740

In [6]: A_count = allSeqs[0].seq.count('A')
        C_count = allSeqs[0].seq.count('C')
        G_count = allSeqs[0].seq.count('G')
        T_count = allSeqs[0].seq.count('T')

        print A_count # number of A's

        144

In [7]: allSeqs[0].seq.count("AUG") # or count how many start codons
Out[7]: 0

In [8]: allSeqs[0].seq.translate() # translate DNA -> Amino Acid
Out[8]: Seq('RNKVSVGEPAEGSLMRPWNKRSSESGGPVYSAHRGHCSRGDPDLLLGRLGSVHG...*VY', HasStopCodon(ExtendedIUPACProtein(), '*'))