开发者

How to get make stats in constant memory

开发者 https://www.devze.com 2023-01-21 12:45 出处:网络
I have a function, which creates some random numerical results. I know, that the result will be an integer in a (small, a - b approx 50) range a, b. I want to create a function which execute the above

I have a function, which creates some random numerical results. I know, that the result will be an integer in a (small, a - b approx 50) range a, b. I want to create a function which execute the above function let's say 1000000 times and calculates, how often the each result appears. (The function takes a random generator to produce the result.) The problem is, I don't know how to do this in constant memory without hard-coding the range's length. My (bad) approach is like this:

values :: [Int]
values = doFunctionNtimes myRandom 1000000
results = map (\x ->length . filter (x==) $ values) [a..b]

Anybody an idea to do this?

Edit:

I think I explained the problem wrong, sorry for this. I have a function, which - depending开发者_StackOverflow社区 on a random gen - gives out some small int value. To make a statistic, I want to know, how often the results appear. As I want to make stats over let's say 1000000 tries, I need constant memory over the number of tries.


import qualified Data.Map as Map
import Data.List (foldl')          -- ' (to fix SO syntax highlighting)

histogram :: (Ord a) => [a] -> Map.Map a Int
histogram = foldl' (\m x -> Map.insertWith' (+) x 1 m) Map.empty

The explanation for why this works and why it is superior to Travis Brown's solution is pretty technical, and will require some patience to understand fully.

If there are only finitely many values that can possibly occur in the list, then this runs in constant memory. Travis's solution has a subtle bug in which the resulting map entries will look like:

(4, 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1)

A very inefficient representation of the number 19. Only when you ask for that element in the map will the giant sum be computed. These "thunks" (delayed-evaluation expressions) will grow linearly with the size of the input.

To prevent this, we use insertWith', which applies the function strictly, that is to say it evaluates the result before it puts it in the map. So then if you insert 4 into the map above, it will evaluate the thunk and you will get a nice tidy:

(4, 20)

And another will evaluate that before adding so you will get:

(4, 21)

So now at least the values of the map are constant space.

The final thing we need to do is to change the right fold to a left fold because Map.insert is strict in its second argument. The following demonstrates the meaning of a right fold.

iw x m = Map.insertWith' (+) x 1 m    -- '

foldr iw Map.empty [1,2,1,3,2,1]
    = iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 Map.empty)))))

Using iw as a simple shorthand. Map.insert being strict in its second argument means you need to evaluate the map into which you are inserting before insert can do any work. I will use the notation { k1 -> v1, k2 -> v2, ... } as a shorthand for maps. Your sequence of evaluation looks like this:

foldr f z [] = z
foldr f z (x:xs) = f x (foldr f z xs)

foldr iw {} [1,2,1,3,2,1]
iw 1 (foldr iw {} [2,1,3,2,1])
iw 1 (iw 2 (foldr iw {} [1,3,2,1]))
iw 1 (iw 2 (iw 1 (foldr iw {} [3,2,1])))
iw 1 (iw 2 (iw 1 (iw 3 (foldr iw {} [2,1]))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (foldr iw {} [1])))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 (foldr iw {} []))))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 {}))))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 {1 -> 1}))))
iw 1 (iw 2 (iw 1 (iw 3 {1 -> 1, 2 -> 1})))
iw 1 (iw 2 (iw 1 {1 -> 1, 2 -> 1, 3 -> 1}))
iw 1 (iw 2 {1 -> 2, 2 -> 1, 3 -> 1})
iw 1 {1 -> 2, 2 -> 2, 3 -> 1}
{1 -> 3, 2 -> 2, 3 -> 1}

So if you have a 1,000,000 element array, we have to go all the way down to the 1,000,000th element to start inserting, thus we need to keep the previous 999,999 elements in memory so we can know what to do next. A left fold solves this:

-- definition of left fold
foldl' f z xs = go z xs             -- '
    where 
    go accum [] = z
    go accum (x:xs) = accum `seq` go (f accum x) xs

foldl' (flip iw) Map.empty [1,2,1,3,2,1]  -- needed to flip arg order to appease foldl'
go {} [1,2,1,3,2,1]
go (iw 1 {}) [2,1,3,2,1]
go (iw 2 {1 -> 1}) [1,3,2,1]
go (iw 1 {1 -> 1, 2 -> 1}) [3,2,1]
go (iw 3 {1 -> 2, 2 -> 1}) [2,1]
go (iw 2 {1 -> 2, 2 -> 1, 3 -> 1}) [1]
go (iw 1 {1 -> 2, 2 -> 2, 3 -> 1}) []
iw 1 {1 -> 2, 2 -> 2, 3 -> 1}
{1 -> 3, 2 -> 2, 3 -> 1}

Now we can see that, finally, if the number of entries in the map is bounded, then this runs in constant space and linear time.


So you have an unbounded number of possible results and want to count how many times each appears in constant memory. That is clearly impossible to do exactly, but the data structure called count-min sketch can be used to do a pretty good approximation. In your case, store the results in a count-min sketch while keeping track of the minimum and maximum value separately, and at the end query the count-min sketch for every integer from the minimum to maximum.


The way I usually handle this kind of problem is to keep track of counts in a map. Data.IntMap works in this case:

import qualified Data.IntMap as I

results :: [Int] -> I.IntMap Int
results = foldr (\x -> I.insertWith (+) x 1) I.empty

At this point you can ask for the endpoints of the range (I.findMin and I.findMax) or look up the count at a particular value in O(log n). It's also very easy to stick everything in an array for faster lookups.


UPDATE: See luqui's answer for a much better version of this code.


As already mentioned by Jouni, Constant memory is not possible, but this count-min sketch sounds like the bomb! (although I have not heard of it before). But what I think you might have asked for is the possibility to store this in one array and only update each frequency. This can be done in haskell with Mutable arrays. Here is an example:

main = do gen <- newStdGen
          n <- liftM (read . head) getArgs
          arr  <- (newArray (a,b) 0) :: IO (IOUArray Int Int)
          replicateM_ n $ do 
               result <- myRand
               x <- readArray arr result
               writeArray arr result (x+1)
          (getAssocs arr :: IO [(Int,Int)]) >>= print

Running the program with +RTS -s , and input 1000000 we get the output

787,874,256 bytes allocated in the heap
         364,536 bytes copied during GC
           5,984 bytes maximum residency (1 sample(s))
          17,928 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

...
  Total time    0.29s  (  0.30s elapsed)
...
  %GC time       0.3%  (2.1% elapsed)
0

精彩评论

暂无评论...
验证码 换一张
取 消