I tried to solve Duplicate each element in a Haskell List task and make it as a full program, that write the list to standard output
Here is my solution
main :: IO () main = getList >>= return . doubleEachItem >>= putStrLn . show getList = return [1,3,5] doubleEachItem :: [Int] -> [Int] doubleEachItem = foldr (++) [] . map (take 2 . repeat)
But when I try to process really long list
getList = return . take 10000000000 $ repeat 15
the program terminated with out of memory error.
The question is: how I can improve the program, so it would be able to process lists of any size?
Edit:
I think program crashed because I run it with runghc
command. In that case ghc consume about 3Gbytes of memory and was killed by operating system. The output file after crash was only about 0.6开发者_开发知识库GBytes. The reason of such behavior is unclear for me.
When I compile the program to native executable ghc haskell03.hs -o haskell03
and execute it with redirection to a file ./haskell03 >out03.txt
it perfectly runs in constant memory space and produces output file with speed about 20MBytes/sec. When the program finished the output file takes 57GBytes. Total execution time was 47 minutes.
Oh! I think I know what's going on. This is subtle.
runghc
runs in interpreted mode, as if you had loaded an run the file through ghci
. getList
is a top-level binding, so its value (and all values it references) are cached. That means that GHC was trying to cache your entire ten billion element array.
You might think that if you inlined the list it would work:
main = putStrLn . show . doubleEachElement $ [1..10^10]
But not so, because main
is a top-level binding as well, and it references [1..10^10]
, so the giant unfolded version of that list will be cached there as well.
The general rule is, when something doesn't depend on the input, it can be cached. So you can fix this by making it appear to depend on the input. In interpreted mode, GHC is not very smart about this, so it's fairly easy to trick it:
main = do
n <- return (10^10)
putStrLn . show . doubleEachElement $ [1..n]
This would work with your getList
as well, if getList
took a parameter instead of hard-coding 10^10
.
Fortunately, in compiled mode, GHC looks harder at the uses of these top-level symbols, so it can see that the list is only used once, and thus can begin garbage collecting its heads as it is output.
This problem does not come up in the real world very often, because programs often depend heavily on their input, so there aren't any top level big constants like this. But yeah, it can be a pain when GHC tries to cache something you don't want it to.
Your code is perfectly lazy, it doesn't need any optimization. What is probably going on is that all the output is on a single line, which is being buffered. You can fix this by turning off buffering:
import System.IO
main = do
hSetBuffering stdout NoBuffering
getList >>= ...
Or by, say, printing each number on its own line
main = getList >>= return . doubleEachItem >>= mapM_ print
This the application of answer of luqui to original program from the question with minimum modification
main :: IO () main = return 1 >>= getList >>= return . doubleEachElement >>= putStrLn . show getList doNotCache = return . take (10^10) $ repeat 15 doubleEachElement :: [Int] -> [Int] doubleEachElement = foldr (++) [] . map (take 2 . repeat)
The changes are emphasized with bold.
As luqui has shown, getList was cached because it was constant expression. But we can make haskell to believe that it is not constant simply by providing dummy argument, dependent on input.
Your doubleEachItem implementation via foldr
require whole list constructed in memory, because foldr
beginning work from end of list. To avoid this, replace foldr
to foldl'
or rewrite function as:
doubleEachItem = concatMap (replicate 2)
精彩评论