Haskell Stack Overflow_问答_开发者_运维开发者技术经验分享

I'm writing a genetic algorithm to generate the string "helloworld". But the evolve function creates a stack overflow when n is 10,000 or more.

module Genetics where

import Data.List (sortBy)
import Random (randomRIO)
import Control.Monad (foldM)

class Gene g where
    -- How ideal is the gene from 0.0 to 1.0?
    fitness :: g -> Float

    -- How does a gene mutate?
    mutate :: g -> IO g

    -- How many species will be explored?
    species :: [g] -> Int

orderFitness :: (Gene g) => [g] -> [g]
orderFitness = reverse . sortBy (\a b -> compare (fitness a) (fitness b))

compete :: (Gene g) => [g] -> IO [g]
compete pool = do
    let s = species pool
    variants <- (mapM (mapM mutate) . map (replicate s)) pool
    let pool' = (map head . map orderFitness) variants
    return pool'

evolve :: (Gene g) => Int -> [g] -> IO [g]
evolve 0 pool = return pool
evolve n pool = do
    pool' <- compete pool
    evolve (n - 1) pool'

With species pool = 8, a pool of 8 genes replicates to 8 groups. Each group mutates, and the fittest of each group is selected for further evolution 开发者_开发知识库(back to 8 genes).

GitHub

If you're interested in performance, I'd use a fast random number generator, such as:

The mersenne-random package

Secondly, compete looks very suspicious, as it is entirely lazy, despite building some potentially large structures. Try rewriting it to be a bit stricter, using the deepseq hammer:

import Control.DeepSeq    

compete :: (Gene g, NFData g) => [g] -> IO [g]
compete pool = do
    let s = species pool
    variants <- (mapM (mapM mutate) . map (replicate s)) pool
    let pool' = (map head . map orderFitness) variants
    pool' `deepseq` return pool'

None of this stuff needs to be in IO, though, (separate issue). Something like the Rand monad may be more appropriate.

Thanks to Don's deepseq suggestion, I was able to narrow the problem to mapM mutate which made too many thunks. The new version has mutate', which uses seq to prevent thunking.

module Genetics where

import Data.List (maximumBy)
import Random (randomRIO)

class Gene g where
    -- How ideal is the gene from 0.0 to 1.0?
    fitness :: g -> Float

    -- How does a gene mutate?
    mutate :: g -> IO g

    -- How many species will be explored in each round?
    species :: [g] -> Int

best :: (Gene g) => [g] -> g
best = maximumBy (\a b -> compare (fitness a) (fitness b))

-- Prevents stack overflow
mutate' :: (Gene g) => g -> IO g
mutate' gene = do
    gene' <- mutate gene
    gene' `seq` return gene'

drift :: (Gene g) => [[g]] -> IO [[g]]
drift = mapM (mapM mutate')

compete :: (Gene g) => [g] -> IO [g]
compete pool = do
    let islands = map (replicate (species pool)) pool
    islands' <- drift islands
    let representatives = map best islands'
    return representatives

evolve :: (Gene g) => Int -> [g] -> IO [g]
evolve 0 pool = return pool
evolve n pool = compete pool >>= evolve (n - 1)

GitHub

Instead of using (map head . map orderFitness) where orderFitness is a sortBy you could use maximumBy and a single map. That doesn't save too much (since you are going from an O(n log n) to O(n) and might be getting another factor of two from eliminating the double map), but is at the least somewhat simpler and more efficient. You would also get rid of a call to reverse.

I doubt this fixes the problem without a deepseq, but it should be an improvement none the less.

Edit: if the standard library and GHC were perfect, then head . sortBy would produce identical code to maximumBy and map head . map sortBy would produce identical code to map (head . sortBy) sadly neither of these things is likely to be true in practice. sortBy will tend to do a bunch of extra memory allocation because it is a divide and conquer algorithm. Combining maps is an optimization you sometimes get, but should not count on.

More importantly, using maximumBy is more declarative. It is easier to see what the code does and how long it will take. It should also be easier to take advantage of in optimizations, because we know what the goal is, not just how we get it.