Lazy evaluation sometimes makes it trickier to really understand how a piece of code for folks used to languages with strict semantics (as I am). Sometimes introducing strictness is necessary to avoid space leaks and to make memory allocations more predictable in certain parts of our code. The usual suggestion is to “carefully sprinkle strict evaluation” in our code; one of the classic examples of memory leak is using foldl
to sum a list of ints, with the result that instead of returning a result using constant space, it ends up taking an outrageous amount of memory before returning a result because thunks pile up (this behaviour is known as space leak). Most of the times I personally find it tricky to add strictness to a piece of Haskell code, so I’d like to share my latest experience doing that.
We’ll be using the Bloom filter implemented in chapter 26 of Real World Haskell as an example, the version contained in the book creates the filter lazily: our goal will be to create a strict version of that particular piece of code. In a nutshell, a Bloom filter is a probabilistic data structure that consists of several hash functions and a bit array whose API allows only insertion and membership querying. The latter API might return false positives with an expected error rate decided when the filter is instantiated. Here’s a function that builds a Bloom filter lazily:
-- file: BloomFilter/BloomFilter.hs
import BloomFilter.Immutable as B (IBloom, fromList)
import BloomFilter.Hash (Hashable, doubleHash)
import Data.List (genericLength)
import Data.Either (either)
mkFromList :: Hashable a => Int -> [a] -> Either String (B.IBloom a)
mkFromList errRate xs =
either Left (Right . mkBFilt) $ suggestSizing (genericLength xs) errRate
where
mkBFilt (bits, numHashes) =
B.fromList (doubleHash numHashes) bits xs
The function suggestSizing
provides the optimal size of the underlying array and the number of hashes to generate given the length of the input list and the desired rate of false positives, but it’s not important for the topic of this article. Let’s try this code out in GHCI:
ƛ :set +s -- to print timing/memory stats after each evaluation
ƛ :load BloomFilter.BloomFilter
ƛ let ebf = mkFromList 0.01 ([1..10^6] :: [Int])
ebf :: Either String (B.IBloom Int)
(0.01 secs, 4658656 bytes)
The fact that ebf
has not been fully evaluated should be clear since the evaluation took almost no time, but let’s ask GHCI for help:
ƛ :print ebf -- prints a value without forcing its computation
ebf = (_t2::Either String (B.IBloom Int))
GHCI is telling us that ebf
is a thunk _ts
of type Either String (B.IBloom Int)
. If we’re still not convinced that ebf
is not evaluated we can ask it if an element is contained in the Bloom filter:
ƛ either (const False) (1 `B.elem`) ebf
True
it :: Bool
(19.44 secs, 13818404512 bytes)
ƛ either (const False) (11 `B.elem`) ebf
True
it :: Bool
(0.01 secs, 3118248 bytes)
From the timing/memory information should be pretty clear now that the evaluation was forced when we explicitly asked for a membership test. That expected given Haskell’s non-strict semantic. If we ask GHCI to give us information about ebf
we can see that now it gives us a different answer:
ƛ :print ebf
ebf = Right
(B.IB
(_t3::Int -> [Word32])
(Data.Array.Base.UArray
(GHC.Word.W32# 0) (GHC.Word.W32# 9592954) 9592955
(_t4::ghc-prim-0.5.0.0:GHC.Prim.ByteArray#)))
Let’s not focus on the types - again not important - GHCI is telling us the value of ebf
after evaluation. We’d like to force this evaluation before the first time the Bloom filter is used, namely when it is created.
Beware GHCI
In GHCI we always need to explicitly specify type annotations for bindings that need to be forced, otherwise the interpreter will infer the most general type and won’t force the evaluation of the term. This is related to “the dreaded monomorphism restriction”.
Tools of the job
There are various ways of forcing evaluation in Haskell, the main ones are: seq
, deepseq
, rnf
(the last two can be found in the Control.DeepSeq
module and require the argument to be an instance of the NFData
type class) or the handy BangPatterns
extension, which is syntactic sugar for seq
. As a first try, let’s force the evaluation of ebf
using bang patters and see what happens:
ƛ :set -XBangPatterns
ƛ let !ebf' = mkFromList 0.01 ([1..10^6] :: [Int])
ebf' :: Either String (B.IBloom Int)
(0.34 secs, 197720920 bytes)
ƛ :print ebf'
ebf' = Right (_t5::B.IBloom Int)
That did something, specifically it evaluated ebf'
a bit so that now we already know that the construction of the Bloom filter succeeded but did we manage to instantiate it? By carefully reading the output of GHCI it should be clear that we’re not quite there yet but let’s again double check:
ƛ either (const False) (11 `B.elem`) ebf'
True
it :: Bool
(19.02 secs, 13624548640 bytes)
The membership test took still 19 seconds, as we expected. So what’s happening here? Now it’s probably a good point to introduce some terminology that will help us out understanding what’s happening and how to go forward.
NF and WHNF
A reducible expression (or redex) is an expression that can be evaluated until a value is obtained, i.e. let x = 1 + 6
is a redex since it can be evaluated to obtain let x = 5
. Let’s again double check it in GHCI:
ƛ let x = 1 + 5 :: Int
x :: Int
ƛ :print x
x = (_t6::Int)
ƛ let !x = 1 + 5 :: Int
x :: Int
ƛ :print x
x = 6
At this point x
cannot be futher evaluated and is said to be in normal (or canonical) form. Now what about an expression like Right (1 + 5)
? It should be clear that it’s not in normal form so can we just force evaluation by adding a bang pattern? Let’s see if that works:
ƛ let !x = Right (1 + 5) :: Either a Int
x :: Either a Int
ƛ :print x
x = Right (_t8::Int)
What’s happening here?! It turns out that an expression in Haskell can be in other form called weak head normal form when it’s not a redex itself and further evaluation of its sub-expressions cannot make it a redex. Right (1 + 5)
isn’t a redex (Right
is a constructor for the Either
type) and it cannot be made one if the sub-expression 1 + 5
is evaluated. Does that mean we have to unwrap the sub-expression in order for it to be evaluated? Not necessarily. We have a few options, namely forcing the evaluation of the sub-expression before we wrap it:
ƛ let !x = let !y = 1 + 5 :: Int in Right y
x :: Either a Int
ƛ :print x
x = Right 6
or levaraging some of the functions in the Control.DeepSeq
module:
ƛ let !x = let x = Right (1 + 5) :: Either a Int in x `deepseq` x
x :: Either a Int
ƛ :print x
x = Right 6
ƛ let !x = Right (1 + 5) :: Either a Int
x :: Either a Int
ƛ rnf x
()
it :: ()
ƛ :print x
x = Right 6
deepseq
is like seq
on steroids, it reduces an expression and all its sub-expressions to normal form (rnf
which stays for “reduce to normal form” does exactly the same). Again keep in mind is that in order to use these two functions the argument must be an instance of NFData
(Normal Form Data).
A more in depth explanation and a bunch of very informative links and more examples can be found in Stephen Diehl’s What I wish I knew when learning Haskell
Taking control
Now that we’are aware of all this, let’s create a strict version of our mkFromList
function and let’s call it mkFromList'
(using the convention other functions like foldr'
use). The first function we need to change is (Right . mkBFilt)
: this is equivalent to \x -> Right (mkBFilt x)
(using eta-expansion) and to \pair -> let bfilt = mkBFilt pair in Right bfilt
if we massage the lambda a bit. Here bfilt
needs to be evaluated so again the easiest thing to do is to add a bang pattern: \pair -> let !bfilt = mkBFilt pair in Right bfilt
. A quick note for about point-free style: adding strictness is a bummer in that respect. Let’s have a look at the following code
-- file: BloomFilter/BloomFilter.hs
mkFromList' errRate xs =
either Left (Right . mkFilt') $ suggestSizing (genericLength xs) errRate
where
mkBFilt' (bits, numHashes) =
let !bfilt = B.fromList (doubleHash numHashes) bits xsin bfilt
By eta-expanding (Right . mkFilt')
we obtain \pair -> Right (mkFilt' pair)
that is a function that will be evaluated lazily. Are we done yet? Almost. Let’s have a look at the type of ebf'
again: Either String (B.IBloom Int)
. What’s IBloom
(the ‘I’ stays for “immutable”)? Here’s how it’s defined:
-- file: BloomFilter/Internals.hs
data IBloom =
IB { hash :: (a -> [Word32])
, array :: UArray Word32 Bool
}
This closely reflects the definition of a Bloom filter, we have a function that returns a list of hashes for a given value and an array of bits. Keeping in mind that a constructor is also a function, we might notice that there is still something we need to force evaluation upon: the array
field. In order to do this let’s write a strict version of mkBFilt
, this time using seq
for a change:
-- file: BloomFilter/BloomFilter.hs
mkBFilt' (bits, numHashes) =
let bfilt = B.fromList (doubleHash numHashes) bits xs
in array bfilt `seq` bfilt
Equivalently, we could have pattern-matched on bfilt
and used a bang pattern on its array
field. The final version of our mkFromList'
function looks something like this:
-- file: BloomFilter/BloomFilter.hs
mkFromList' :: Hashable a => ErrorRate -> [a] -> Either String (B.IBloom a)
mkFromList' errRate xs =
either Left rightBFilt' $ suggestSizing (genericLength xs) errRate
where
rightBFilt' x = let !bfilt = mkBFilt' x in Right bfilt
mkBFilt' (bits, numHashes) =
let bfilt = B.fromList (doubleHash numHashes) bits xs
in array bfilt `seq` bfilt
Let’s test it in GHCI:
ƛ let !ebf'' = mkFromList' 0.01 ([1..10^6] :: [Int])
ebf'' :: Either String (B.IBloom Int)
(19.29 secs, 13819004104 bytes)
ƛ :print ebf''
ebf'' = Right
(B.IB
(_t1::Int -> [Word32])
(Data.Array.Base.UArray
(GHC.Word.W32# 0) (GHC.Word.W32# 9592954) 9592955
(_t2::ghc-prim-0.5.0.0:GHC.Prim.ByteArray#)))
And YES! We finally managed to fully evaluate our Bloom filter before its first use in our code.
Wrapping up
- There are multiple ways we can use to introduce strictness in Haskell code:
seq
, theBangPatterns
extension or the functions in theControl.DeepSeq
module - Using GHCI and leveraging the
:print
command and the+s
flag can help us understanding how our code is evaluated while developing - Keep in mind the difference between NF and WHNF: if we cannot manage to force evaluation of an expression it’s because some sub-expression is still in WHNF
- Carefully analyse our code to identify where strictness needs to be added