The Context

The context of this question is that I want to play around with Gene Expression Programming (GEP), a form of evolutionary algorithm, using Erlang. GEP makes use of a string based D开发者_开发知识库SL called 'Karva notation'. Karva notation is easily translated into expression parse trees, but the translation algorithm assumes an implementation having mutable objects: incomplete sub-expressions are created early-on the translation process and their own sub-expressions are filled-in later-on with values that were not known at the time they were created.

The purpose of Karva notation is that it guarantees syntactically correct expressions are created without any expensive encoding techniques or corrections of genetic code. The problem is that with a single-assignment programming language like Erlang, I have to recreate the expression tree continually as each sub expression gets filled in. This takes an inexpensive - O(n)? - update operation and converts it into one that would complete in exponential time (unless I'm mistaken). If I can't find an efficient functional algorithm to convert K-expressions into expression trees, then one of the compelling features of GEP is lost.

The Question

I appreciate that the K-expression translation problem is pretty obscure, so what I want is advice on how to convert an inherently-non-functional algorithm (alg that exploits mutable data structures) into one that does not. How do pure functional programming languages adapt many of the algorithms and data structures that were produced in the early days of computer science that depend on mutability to get the performance characteristics they need?

Carefully designed immutability avoids unecessary updating

Immutable data structures are only an efficiency problem if they're constantly changing, or you build them up the wrong way. For example, continually appending more to the end of a growing list is quadratic, whereas concatenating a list of lists is linear. If you think carefully, you can usually build up your structure in a sensible way, and lazy evaluation is your friend - hand out a promise to work it out and stop worrying.

Blindly trying to replicate an imperative algorithm can be ineffecient, but you're mistaken in your assertion that functional programming has to be asymptotically bad here.

Case study: pure functional GEP: Karva notation in linear time

I'll stick with your case study of parsing Karva notation for GEP. ( I've played with this solution more fully in this answer.)

Here's a fairly clean pure functional solution to the problem. I'll take the opportunity to name drop some good general recursion schemes along the way.

Code

(Importing Data.Tree supplies data Tree a = Node {rootLabel :: a, subForest :: Forest a} where type Forest a = [Tree a].)

import Data.Tree
import Data.Tree.Pretty -- from the pretty-tree package for visualising trees

arity :: Char -> Int
arity c 
  | c `elem` "+*-/" = 2
  | c `elem` "Q" = 1
  | otherwise = 0

A hylomorphism is the composition of an anamorphism (build up, unfoldr) and a catamorphism (combine, foldr). These terms are introduced to the FP community in the seminal paper Functional Programming with Bananas, Lenses and Barbed wire.

We're going to pull the levels out (ana/unfold) and combine them back together (cata/fold).

hylomorphism :: b -> (a -> b -> b) -> (c -> (a, c)) -> (c -> Bool) -> c -> b
hylomorphism base combine pullout stop seed = hylo seed where
 hylo s | stop s = base
        | otherwise = combine new (hylo s') 
          where (new,s') = pullout s

To pull out a level, we use the total arity from the previous level to find where to split off this new level, and pass on the total arity for this one ready for next time:

pullLevel :: (Int,String) -> (String,(Int,String))
pullLevel (n,cs) = (level,(total, cs')) where
                   (level,        cs') = splitAt n cs
                   total = sum $ map arity level

To combine a level (as a String) with the level below (that's already a Forest), we just pull off the number of trees that each character needs.

combineLevel :: String -> Forest Char -> Forest Char
combineLevel "" [] = []
combineLevel (c:cs) levelBelow = Node c subforest : combineLevel cs theRest 
      where (subforest,theRest) = splitAt (arity c) levelBelow

Now we can parse the Karva using a hylomorphism. Note that we seed it with a total arity from outside the string of 1, since there's only one node at the root level. Correspondingly we apply head to the result to get this singleton back out after the hylomorphism.

karvaToTree :: String -> Tree Char
karvaToTree cs = let
  zero (n,_) = n == 0          
    in head $ hylomorphism [] combineLevel pullLevel zero (1,cs)

Linear Time

There's no exponential blowup, nor repeated O(log(n)) lookups or expensive modifications, so we shouldn't be in too much trouble.

arity is O(1)
splitAt part is O(part)
pullLevel (part,cs) is O(part) for grab using splitAt to get level, plus O(part) for the map arity level, so O(part)
combineLevel (c:cs) is O(arity c) for the splitAt, and O(sum $ map arity cs) for the recursive call
hylomorphism [] combineLevel pullLevel zero (1,cs)
- makes a pullLevel call for each level, so the total pullLevel cost is O(sum parts) = O(n)
- makes a combineLevel call for each level, so the total combineLevel cost is O(sum $ map arity levels) = O(n), since the total arity of the entire input is bound by n for valid strings.
- makes O(#levels) calls to zero (which is O(1)), and #levels is bound by n, so that's below O(n) too
Hence karvaToTree is linear in the length of the input.

I think that puts to rest the assertion that you needed to use mutability to get a linear algorithm here.

Demo

Let's have a draw of the results (because Tree is so full of syntax it's hard to read the output!). You have to cabal install pretty-tree to get Data.Tree.Pretty.

see :: Tree Char -> IO ()
see = putStrLn.drawVerticalTree.fmap (:"")

ghci> karvaToTree "Q/a*+b-cbabaccbac"
Node {rootLabel = 'Q', subForest = [Node {rootLabel = '/', subForest = [Node {rootLabel = 'a', subForest = []},Node {rootLabel = '*', subForest = [Node {rootLabel = '+', subForest = [Node {rootLabel = '-', subForest = [Node {rootLabel = 'b', subForest = []},Node {rootLabel = 'a', subForest = []}]},Node {rootLabel = 'c', subForest = []}]},Node {rootLabel = 'b', subForest = []}]}]}]}

ghci> see $ karvaToTree "Q/a*+b-cbabaccbac"
      Q      
      |      
      /      
      |      
 ------      
/      \     
a      *     
       |     
       ----- 
      /     \
      +     b
      |      
     ----    
    /    \   
    -    c   
    |        
    --       
   /  \      
   b  a

which matches the output expected from this tutorial where I found the example:

Can I always convert mutable-only algorithms to single-assignment and still be efficient?

There isn't a single way to do this, it really has to be attempted case-by-case. I typically try to break them down into simpler operations using fold and unfold and then optimize from there. Karva decoding case is a breadth-first tree unfold as others have noted, so I started with treeUnfoldM_BF. Perhaps there are similar functions in Erlang.

If the decoding operation is unreasonably expensive, you could memoize the decoding and share/reuse subtrees... though it probably wouldn't fit into a generic tree unfolder and you'd need to write specialized function to do so. If the fitness function is slow enough, it may be fine to use a naive decoder like the one I have listed below. It will fully rebuild the tree each invocation.

import Control.Monad.State.Lazy
import Data.Tree

type MaxArity = Int
type NodeType = Char

treeify :: MaxArity -> [Char] -> Tree NodeType
treeify maxArity (x:xs) = evalState (unfoldTreeM_BF (step maxArity) x) xs
treeify _ [] = fail "empty list"

step :: MaxArity -> NodeType -> State [Char] (NodeType, [NodeType])
step maxArity node = do
  xs <- get
  -- figure out the actual child node count and use it instead of maxArity
  let (children, ys) = splitAt maxArity xs
  put ys
  return (node, children)

main :: IO ()
main = do
 let x = treeify 3 "0138513580135135135"
 putStr $ drawTree . fmap (:[]) $ x
 return ()

There are a couple of solutions when mutable state in functional programming is required.

Use a different algorithm that solves the same problem. E.g. quicksort is generally regarded as mutable and may therefore be less useful in a functional setting, but mergesort is generally better suited for a functional setting. I can't tell if this option is possible or makes sense in your case.
Even functional programming languages usually provide some way to mutate state. (This blog post seems to show how to do it in Erlang.) For some algorithms and data structures this is indeed the only available option (there's active research on the topic, I think); for example hash tables in functional programming languages are generally implemented with mutable state.

In your case, I'm not so sure immutability really leads to a performance bottleneck. You are right, the (sub)tree will be recreated on update, but the Erlang implementation will probably reuse all the subtrees that haven't changed, leading to O(log n) complexity per update instead of O(1) with mutable state. Also, the nodes of the trees won't be copied but instead the references to the nodes, which should be relatively efficient. You can read about tree updates in a functional setting in e.g. the thesis from Okasaki or in his book "Purely Functional Data Structures" based on the thesis. I'd try implementing the algorithm with an immutable data structure and switch to a mutable one if you have a performance problem.

Also see some relevant SO questions here and here.

I think I figured out how to solve your particular problem with the K trees, (the general problem is too hard :P). My solution is presented in some horrible sort of hybrid Python-like psudocode (I am very slow on my FP today) but it doesn't change a node after you create one (the trick is building the tree bottom-up)

First, we need to find which nodes belong to which level:

levels currsize nodes = 
    this_level , rest = take currsize from nodes, whats left
    next_size = sum of the arities of the nodes
    return [this_level | levels next_size rest]
(initial currsize is 1)

So in the +/*abcd, example, this should give you [+, /*, abcd]. Now you can convert this into a tree bottom up:

curr_trees = last level
for level in reverse(levels except the last)
    next_trees = []
    for root in level:
        n = arity of root
        trees, curr_trees = take n from curr_trees, whats left
        next_trees.append( Node(root, trees) )
    curr_trees = next_trees

curr_trees should be a list with the single root node now.

I am pretty sure we can convert this into single assignment Erlang/Haskell very easily now.