Below are three versions of brute-force Pythagorean triplets problem with an additional constraint that a+b+c=1000. All of them were complied with -O3 with GHC 7.0.3. Sample run-times are listed below.
Questions:
- Why does the second version run faster then the first?
- Why does the third version run faster than the second?
I realize that the difference is small, but the ordering is consistent on average.
main=print . product . head $ [[a,b,c] | a<-[1..1000],b<-[a..1000], let c=1000-a-b, a^2+b^2==c^2]
real 0m0.046s
user 0m0.039s
sys 0m0.005s
main=print . product . head $ [[a,b,c] | a<-[1..1000],b<-[1..1000], let c=1000-a-b, a^2+b^2==c^2]
real 0m0.045s
user 0m0.036s
sys 0m0.006s
main=print . product . head $ [[a,b,c] | a<-[1..1000],b<-[1..1000], b>=a, let c=1000-a-b, a^2+b^2==c^2]
real 0m0.040s
user 0m0.033s
sys 0m0.开发者_JAVA百科005s
Let's name the three programs A, B and C respectively.
C vs B
We'll begin with the simplest: C has an extra constraint (b >= a
) relative to B. Intuitively, this means that the search space traversed in C is smaller than that of B. We can rationalize this by noting that instead of ranging over every possibly permuted pair a, b
(of which we know there are 1000^2=1000000
possible pairs) we instead do not consider all those cases where b
is smaller than a
. Presumably, the check for whether or not b >= a
produces a little bit of extra code (the comparison) that is outweighed by computation avoided by running the comparison, hence we note a (slight) speed-up. Fair enough.
B vs A
The next is a bit trickier: It appears that A has the same constraint as C (b >= a
), but encoded differently (i.e. here we have it encoded as the range of values b
may attain in the List Monad
). We might think then that A should run faster than B (as indeed, it should run similarly to C). Clearly then, our intuition is lacking.
To the Core
Now, since we can't always trust our intuition we should investigate what's really going on in the generated GHC Core. Lets dump the core for our 3 programs (sans optimizations):
for p in A B C
do
ghc -ddump-simpl $p.hs >> $p.core
done
If we compare B.core
and C.core
, we'll note that both files have roughly the same structure:
Begin by calling a few familiar functions (System.IO.print ...)
, (Data.List.product ...)
and (GHC.List.head ...)
Next, we define a pair of nested recursive functions with signature:
ds_dxd [Occ=LoopBreaker]
:: [GHC.Integer.Type.Integer] -> [[GHC.Integer.Type.Integer]]
We call each of these defined functions on an enumeration of the form:
(GHC.Enum.enumFromTo
@ GHC.Integer.Type.Integer
GHC.Num.$fEnumInteger
(GHC.Integer.smallInteger 1)
(GHC.Integer.smallInteger 1000)))
and perform our logic within the innermost defined function. Of note, we can see that in B.core
we see
case GHC.Classes.==
@ GHC.Integer.Type.Integer
...
(GHC.Num.+
...
(GHC.Real.^
...
ds3_dxc
(GHC.Integer.smallInteger 2))
(GHC.Real.^
...
ds8_dxg
(GHC.Integer.smallInteger 2)))
(GHC.Real.^
...
c_abw
(GHC.Integer.smallInteger 2))
corresponding to a naive filter of all the possible values conforming to our constraint, whereas in C.core
, we instead have:
case GHC.Classes.>=
@ GHC.Integer.Type.Integer GHC.Classes.$fOrdInteger ds8_dxj ds3_dxf
of _ {
GHC.Bool.False -> ds5_dxh ds9_dxk;
GHC.Bool.True ->
let {
...
case GHC.Classes.==
...
corresponding to the addition of an extra >=
constraint before our triplet constraint, and hence searching through fewer integers for a shorter runtime, just as our intuition expects.
In comparing A.core
and B.core
, we immediately see a familiar structure (a nested pair of recursive functions each called over an enumeration) and in-fact, it appears that the core output for A
and B
are nearly identical! The difference, it seems, lies in the innermost enumeration:
ds5_dxd
(GHC.Enum.enumFromTo
@ GHC.Integer.Type.Integer
GHC.Num.$fEnumInteger
ds3_dxb
(GHC.Integer.smallInteger 1000))
where we see the enumeration range from a given induction variable ds3_dxb
to 1000
, as opposed to remaining as a static range ([1..1000
]).
What gives then? Should this not indicate that A should run faster than B? (we naively expect A to perform similarly to C, given that they implement the same constraints). Well, it turns out that the various compiler optimizations in-play produce exceedingly complex behaviour, and the various combinations often yield non-intuitive (and frankly bizarre) results, and in this case, we have two compilers interplaying with one another: ghc
and gcc
. In order to have a chance at understanding these results, we have to rely on the generated optimized core (Ultimately though, it's the generated assembler that truly counts, but we'll ignore that for now).
To the Optimized Core, and Beyond
Let's generate the optimized core:
for p in A B C
do
ghc -O3 -ddump-simpl $p.hs >> $p.core
done
and compare our problem child(A) to its faster counterparts. Comparatively B and C both perform a class of optimizations that A alone cannot: let-floating and lambda-lifting. We can see this by noting that our recursive functions in B and C have some 40
fewer lines of code in them, resulting in tighter inner-loops. To understand why A is not benefiting from this optimization, we should take a look at the code that is not being floated out:
let {
c1_s10T
:: GHC.Integer.Type.Integer
-> [[GHC.Integer.Type.Integer]]
-> [[GHC.Integer.Type.Integer]]
[LclId, Arity=2, Str=DmdType LL]
c1_s10T =
\ (ds2_dxg :: GHC.Integer.Type.Integer)
(ds3_dxf :: [[GHC.Integer.Type.Integer]]) ->
let {
c2_s10Q [Dmd=Just L] :: GHC.Integer.Type.Integer
[LclId, Str=DmdType]
c2_s10Q = GHC.Integer.minusInteger lvl2_s10O ds2_dxg } in -- subtract
case GHC.Integer.eqInteger
(GHC.Integer.plusInteger lvl3_s10M (GHC.Real.^_^ ds2_dxg lvl_r11p))
-- add two squares (lve3_s10M has been floated out)
(GHC.Real.^_^ c2_s10Q lvl_r11p)
-- ^ compared to this square
of _ {
GHC.Bool.False -> ds3_dxf;
GHC.Bool.True ->
GHC.Types.:
@ [GHC.Integer.Type.Integer]
(GHC.Types.:
@ GHC.Integer.Type.Integer
ds_dxe
(GHC.Types.:
@ GHC.Integer.Type.Integer
ds2_dxg
(GHC.Types.:
@ GHC.Integer.Type.Integer
c2_s10Q
(GHC.Types.[] @ GHC.Integer.Type.Integer))))
ds3_dxf
} } in
that is, a subtraction (minusInteger
) and an equality (eqInteger
) as well as two squares (^_^
) are being performed within the critical section of our loop (represented by the body of a helper function), whereas the same helper function in C.core
contains fewer computations (and if we dug further, we'd see that it's because GHC is unable to determine whether it is safe to float these calculations out in the optimization pass). This matches our earlier analysis head-on, as we can see that the constraint (b >= a
) is indeed present, just unlike C, we have been unable to float the majority of the redundant calculation outside the loop.
To confirm, let's increase the bounds of the loop involved arbitrarily (for the sake of demonstration), say, to [1..10000]
. We should expect to see that A's runtime behaviour should be asymptotically close to C's runtime behaviour, just as we expect B to be left in the dust.
➜ time ./A
./A 0.37s user 0.01s system 74% cpu 0.502 total
➜ time ./B
./B 3.21s user 0.02s system 99% cpu 3.246 total
➜ time ./C
./C 0.33s user 0.01s system 99% cpu 0.343 total
and what do you know, it's just as we expect! Your initial bounds were simply too small for any interesting performance characteristics to shine through (whatever theory may tell you, constant overhead does matter in practice). Another way of looking at this result is that our initial intuition about A matching C's performance was actually more accurate then it first appeared.
Of course, all of this may be overkill for the code sample at hand, but this sort of analysis can be very useful in resource-constrained environments.
精彩评论