GHC, One compiler to RULE them all

I am using some GHC pragmas to optimize alga and I must admit that they are very powerful. On the wise counsel of one of my GSoC mentors, A. Mokhov, here is a little summary of what I discovered, as this can help others.

GHC is the employee you need

First of all, it is important to understand the power of GHC.

As a Haskell compiler, GHC is already great, but it is greater than that because it can optimize your code.

For example, let us consider times4List that multiply by 4 every member of a list of Int.

times2Times2List :: [Int] -> [Int]
times2Times2List = map (2*) . map (2*)

Obviously, this definition is equivalent to

times4List :: [Int] -> [Int]
times4List = map (4*)

The second definition should be faster, because it traverses the list once, while the first do it 2 times.

Let us try with a very simple code to benchmark this, using Criterion:

import Criterion.Main

main :: IO ()
main = defaultMain
  [ bench "times2Times2List" $ nf times2Times2List [1..10000]
  , bench "times4List" $ nf times4List [1..10000]
  ]

times2Times2List :: [Int] -> [Int]
times2Times2List = map (2*) . map (2*)

times4List :: [Int] -> [Int]
times4List = map (4*)

If we compile the code without optimization, it produces:

benchmarking times2Times2List
time                 726.9 μs   (707.4 μs .. 749.2 μs)
[...]
benchmarking times4List
time                 397.2 μs   (396.0 μs .. 398.5 μs)
[...]

So as expected, the first is slower.

Now if we recompile the same file but asking GHC to optimize our code (with -O):

benchmarking times2Times2List
time                 215.0 μs   (214.9 μs .. 215.2 μs)
[...]
benchmarking times4List
time                 212.2 μs   (211.9 μs .. 212.6 μs)
[...]

So GHC, among other optimizations, merged the two maps.

GHC’s RULEs pragmas

But how GHC figured out to combine the two maps? With a cool pragma: the rewrites rules. It is very simple and yet very powerful.

Let us see what optimized the two first examples (this is not true, you can see the real RULES used here, but we will assume it for the sake of simplicity).

{-# RULES
  "map/map" forall f g xs. map f (map g xs) = map (f.g) xs
 #-}

Basically, this rule is telling GHC to change map f (map g xs) to map (f.g) xs when types match.

Everything is important

The actual little words “when types match” lead to something very cool: you can specialize a function to a specific type.

Imagine you have a function that internally deals with Data.Set to count the unique elements of a list:

import qualified Data.Set as Set

countUniques :: Ord a => [a] -> Int
countUniques = Set.size . Set.fromList

It is well known that Data.IntSet is faster than Data.Set. So why not use IntSet when possible? And, even better, is it possible to do the change “behind” the user’s back, so the user does not have to use the specialized version “by hand”?

module Test where

import qualified Data.Set   as Set
import qualified Data.IntSet as IntSet

countUniques :: Ord a => [a] -> Int
countUniques = Set.size . Set.fromList
{-# RULES
  "countUniques/Int" countUniques = countUniquesInt
  #-}
{-# INLINE [1] countUniques #-}

countUniquesInt :: [Int] -> Int
countUniquesInt = IntSet.size . IntSet.fromList

Now, you are telling GHC to change countUniques by the faster countUniquesInt when possible, even is the user’s code (do not pay attention to the INLINE pragma, it is discussed after).

In two other files, here is some benchmarking code:

Alone.hs

module Alone where

import qualified Data.Set as Set

countUniquesP :: Ord a => [a] -> Int
countUniquesP = Set.size . Set.fromList

Main.hs

import Test
import Alone
import Criterion.Main

aListOfInt :: [Int]
aListOfInt = [1..10000] ++ [1..1000]

main :: IO ()
main = defaultMain
        [ bench "countUniques" $ nf countUniques aListOfInt
        , bench "countUniquesP" $ nf countUniquesP aListOfInt]

Compiling with ghc Main.hs -O and running it gives:

benchmarking countUniques
time                 530.9 μs   (528.6 μs .. 533.9 μs)
[...]

benchmarking countUniquesP
time                 836.3 μs   (834.0 μs .. 838.2 μs)
[...]

Using the rewrite rule, we get more than a 50% speedup!

Try to remove the INLINE line in my previous example and re-build using ghc Main.hs -O. Ghc will complain about an INLINE pragma, and in performances:

benchmarking countUniques
time                 880.6 μs   (855.6 μs .. 907.1 μs)
[...]

benchmarking countUniquesP
time                 885.0 μs   (858.8 μs .. 910.2 μs)
[...]

The countUniques/Int rule has not fired!

So what did the INLINE pragma?

INLINE, please

Now let’s say that you are going to use a strange library which provide:

times2List :: [Int] -> [Int]
times2List = map (2*)

Then one can write:

times4List :: [Int] -> [Int]
times4List = map (2*) . times2List

So does the optimization occur?

That is the pretty part: yes it actually takes place. To help to figure out such optimizations, GHC inlines your code. Because Haskell is pure, one can replace a function by its definition.

If GHC estimate that it worth it, the compiler will replace your function by its definition. And you guessed it, GHC tries to apply RULES at each pass. So conceptually, GHC translates:

times4List :: [Int] -> [Int]
times4List = map (2*) . times2List

times4List :: [Int] -> [Int]
times4List = map (2*) . map (2*)

and apply the previous rule:

times4List :: [Int] -> [Int]
times4List = map (4*)

So GHC does this pretty well itself, but you can sometimes want to override its default behavior (in fact, it almost every time inlines small functions, but is more reluctant to inline big functions). That is what INLINE pragma is made for (and the corresponding NOINLINE).

Adding

{-# INLINE myComplicatedFunc #-}

will make GHC very keen to inline myComplicatedFunc.

Inlining phase

You can even add the phase on which you want the inlining to be done. Currently, GHC runs 3 phases (numbered from 2 to 0) in which it tries to inline and applies rewrite rules. The syntax is {-# INLINE [k] myComplicatedFunc #-}, which, quoting GHC manual, means:

do not inline myComplicatedFunc until phase k, but from phase k onwards be very keen to inline it.

Why do we care? Because of RULES! If we INLINE a function too early in the process, related RULES might not fire. So it is always a good idea to specify an INLINE pragma with phase control when using rules, to ensure that nothing is done too early preventing your super rule to fire!

So what happened to countUniques?

Because its definition is very simple, GHC inlined it and prevented the RULE to fire!

Adding {-# INLINE [1] countUniques #-} ensure that GHC will not inline countUniques too early and thus that the RULE will fire.

Why not everything is inlined?

Because it is sometimes not possible (for a recursive function for example) or sometimes useless (transforming map f xs in map (\x -> fbody x) xs is not useful) and this takes time to the compiler. Moreover inlining everything when it is not needed can lead to an enormous amount of code.

SPECIALIZE yourself

One other pragma was once a RULE but is now a pragma in full: SPECIALIZE.

When you have a class constraint in a function, for example with:

plusTwoRec :: Num a => [a] -> [a]
plusTwoRec [] = []
plusTwoRec (x:xs) = x+2:plusTwoRec xs

(It is basically map (+2) but we will not use this definition either GHC will inline it too quickly).

The function produced by GHC is containing another argument called a dictionary carrying information about class methods (here the Num class). Adding an argument and having to lookup through it takes time. So you can specialize the function using the SPECIALIZE pragma.

{-# SPECIALIZE plusTwoRec :: [Int] -> [Int] #-}

This will create another function, with the same code than the original plusTwoRec but with the specialized type (and thus, without the dictionary) and a rule to change the two functions when types match.

To benchmark all of this, we will use three files:

Alone.hs

module Alone where

plusTwoRecP :: Num a => [a] -> [a]
plusTwoRecP [] = []
plusTwoRecP (x:xs) = x+2:plusTwoRecP xs

Todo.hs

module Todo where

{-# SPECIALIZE plusTwoRec :: [Int] -> [Int] #-}
plusTwoRec :: Num a => [a] -> [a]
plusTwoRec [] = []
plusTwoRec (x:xs) = x+2:plusTwoRec xs

Main.hs

import Todo
import Alone
import Criterion.Main

aListOfInt :: [Int]
aListOfInt = [1..10000]

main :: IO ()
main = defaultMain
  [ bench "plusTwoRec" $ nf plusTwoRec aListOfInt
  , bench "plusTwoRecP" $ nf plusTwoRecP aListOfInt
  ]

(Merging Alone.hs and Todo.hs prevent the rule to fire in GHC 8.4.3, this will be corrected in a next release (see https://ghc.haskell.org/trac/ghc/ticket/15445)).

Compiling with ghc Main.hs -O and running it:

benchmarking plusTwoRec
time                 187.3 μs   (184.6 μs .. 190.8 μs)
[...]

benchmarking plusTwoRecP
time                 396.0 μs   (393.9 μs .. 398.3 μs)
[...]

Compiling with ghc Main.hs -O -fno-enable-rewrite-rules and running it:

benchmarking plusTwoRec
time                 396.3 μs   (394.4 μs .. 398.4 μs)
[...]

benchmarking plusTwoRecP
time                 395.3 μs   (393.8 μs .. 397.7 μs)
[...]

Specialization is a great way to optimize a polymorphic library when you know that it will be used mostly with one type.

Some examples of specialization can be found in the actual GHC.Base.

Why not everything is specialized?

When you have to deal with the outside you cannot tell in advance what types will your user need; moreover creating a specialized function takes space and doing this too much can create a very big library.

Inspect the core

But how to figure out what really does GHC? You can inspect the code produced after the inline/rules passes by passing -ddump-simpl to it. It will throw a lot of code, not always very understandable, but useful for comparison.

Conclusion

GHC is like your grandpa’s old car. When you are using it, it goes well, but your grandpa was able to make miracles with this old thing! How? Only because he knew how to use it properly :) .

Note: This article presents a very (very) superficial point of view about these pragmas. Please don’t hesitate to check the GHC manual related chapter or comment if you find something strange!