I'm trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)
I tried to answser this question using the following Erlang code:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case io:get_line(S,'') of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
{Sequences,Total}=scanLines(standard_io,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]),
halt().
Compilation/Execution:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
this code seems to work fine for a small fasta file but it takes hours to parse a larger 开发者_如何学Pythonone (>100Mo). Why ? I'm an Erlang newbie, can you please improve this code ?
If you need really fast IO then you have to do little bit more trickery than usual.
-module(g).
-export([s/0]).
s()->
P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
r(P, 0, 0),
halt().
r(P, C, L) ->
receive
{P, {data, {eol, <<$>:8, _/binary>>}}} ->
r(P, C+1, L);
{P, {data, {eol, Line}}} ->
r(P, C, L + size(Line));
{'EXIT', P, normal} ->
io:format("~p~n",[L/C])
end.
It is fastest IO as I know but note -noshell -noinput
.
Compile just like erlc +native +"{hipe, [o3]}" g.erl
but with -smp disable
erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl
and run:
time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464
real 0m3.241s
user 0m3.060s
sys 0m0.124s
With -smp enable
but native it takes:
$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m5.103s
user 0m4.944s
sys 0m0.112s
Byte code but with -smp disable
(almost in par with native because most of work is done in port!):
$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m3.565s
user 0m3.436s
sys 0m0.104s
Just for completeness byte code with smp:
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464
real 0m5.433s
user 0m5.236s
sys 0m0.128s
For comparison sarnold version gives me wrong answer and takes more on same HW:
$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776
real 0m17.569s
user 0m16.749s
sys 0m0.664s
EDIT: I have looked at characteristics of uniprot_sprot.fasta
and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled
version can handle 1.18 million text lines per second (71MB/s in line oriented IO).
I too am learning Erlang, thanks for the fun question.
I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.
Also, if you don't mind working with a file directly rather than standard_io
, perhaps you could speed things along by using file:open(..., [raw, read_ahead])
. raw
means the file must be on the local node's filesystem, and read_ahead
specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)
I'd expect the read_ahead
to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".
EDIT
Using file:open("uniprot_sprot.fasta", [read, read_ahead])
gets 1m31s
on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)
Using file:open(.., [read, read_ahead])
and file:read_line(S)
, I get 0m34s
.
Using file:open(.., [read, read_ahead, raw])
and file:read_line(S)
, I get 0m9s
. Yes, nine seconds.
Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:
-module(golf).
-export([test/0]).
line([],{Sequences,Total}) -> {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
scanLines(S,Sequences,Total)->
case file:read_line(S) of
eof -> {Sequences,Total};
{error,_} ->{Sequences,Total};
{ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
end .
test()->
F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
case F of
{ ok, File } ->
{Sequences,Total}=scanLines(File,0,0),
io:format("~p\n",[Total/(1.0*Sequences)]);
{ error, Reason } ->
io:format("~s", Reason)
end,
halt().
It looks like your big performance problems have been solved by opening the file in raw mode, but here's some more thoughts if you need to optimise that code further.
Learn and use fprof.
You're using string:strip/1
primarily to remove the trailing newline. As erlang values are immutable you have to make a complete copy of the list (with all the associated memory allocation) just to remove the last character. If you know the file is well formed, just subtract one from your count, otherwise I'd try writing a length function the counts the number of relevant characters and ignores irrelevant ones.
I'm wary of advice that says binaries are better than lists, but given how little processing you it's probably the case here. The first steps are to open the file in binary mode and using erlang:size/1
to find the length.
It won't affect performance (significantly), but the multiplication by 1.0 in Total/(1.0*Sequences)
is only necessary in languages with broken division. Erlang division works correctly.
The call string:len(string:strip(L))
traverses the list at least twice (I'm unaware of the string:strip implementation). Instead you could write a simple function to count the line length w/0 the spaces:
stripped_len(L) ->
stripped_len(L, 0).
stripped_len([$ |L], Len) ->
stripped_len(L, Len);
stripped_len([_C|L], Len) ->
stripped_len(L, Len + 1);
stripped_len([], Len) ->
Len.
The same method can be applied to binaries as well.
Did you try Elixir (elixir-lang.org) which is runs on top of Erlang and has a syntax similar to Ruby. Elixir solves String problems in the following way:
Elixir strings are UTF8 binaries, with all the raw speed and memory savings that brings. Elixir has a String module with Unicode functionality built-in and is a great example of writing code that writes code. String.Unicode reads various Unicode database dumps such as UnicodeData.txt to dynamically generate Unicode functions for the String module built straight from that data! (http://devintorr.es/blog/2013/01/22/the-excitement-of-elixir/)
Just wonder whether Elixir would be faster?
精彩评论