I have a DNA file in the following format:
>gi|5524211|gb|AAD44166.1| cytochrome
ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGC
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
ACACCCCCCCCGGTGTGTGTGGGGGGTTAAAAATGATGAGTGATGAGTGAGTTGTGTG
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAG开发者_如何学JAVACATCGACGACT
TTCTATCATCATTCGGCGGGGGGATATATTATAGCGCGCGATTATTGCGCAGTCTACG
TCATCGACTACGATCAGCATCAGCATCAGCATCAGCATCGACTAGCATCAGCTACGAC
How do I read this file and extract the DNA sequence part (ACCAGAGCGG...
) without any newlines, for example:
ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGCCTACATCATCACAGCAGCATCA
Maybe regex isn't needed?
If there's always only one line of header :
dnalines = text.split('\n')[1:]
dna = ''.join(dnalines)
With text = the contents of your file (for example, text = open('yourfile').read()
)
I did some tests, and it appears that the following is more efficient than delroth's answer:
text.split('\n', 1)[1].replace('\n', '')
Edit: wait, it's not so simple. I timed both methods, twice, using Python 2.6.4 and 3.1.1, on an ~30MB file:
Python 2.6.4, my version:
$ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')" 10 loops, best of 3: 221 msec per loop $ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')" 10 loops, best of 3: 219 msec per loop
Python 2.6.4, delroth's version:
$ python -m timeit -c "''.join(open('x').read().split('\n')[1:])" 10 loops, best of 3: 392 msec per loop $ python -m timeit -c "''.join(open('x').read().split('\n')[1:])" 10 loops, best of 3: 390 msec per loop
Python 3.1.1, my version:
$ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')" 10 loops, best of 3: 803 msec per loop $ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')" 10 loops, best of 3: 798 msec per loop
Python 3.1.1, delroth's version:
$ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])" 10 loops, best of 3: 610 msec per loop $ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])" 10 loops, best of 3: 610 msec per loop
Conclusion: Python 3 is much slower, and it depends on the Python version which of the two code snippets is faster!
精彩评论