Practical deforestation for fixed-point structures -- introducing tungsten
This post introduces tungsten
, a minimalist library that provides deforestation for free to any structure expressed as a fixed-point. It is proved using free theorems and tested using inspection-testing
.
Introduction
I worked earlier this year on bringing deforestation specifically to algebraic graphs, recounted in this past blog post.
The whole process was basically copy-pasting things from the GHC.Base
module that works on lists and adapting them to work on graphs.
As my first-year professor once told me:
Never copy-paste code. If you copy-paste a line of code, you are either introducing a bug or missing an opportunity to generalize your program.
So, as I was not introducing a bug, it seems that I was missing an opportunity to generalize these rules. The theory already showed that this was possible but there was no (known to me) usable implementation. As the title suggests, I came to writing tungsten
, a library (not yet on Hackage) that generalizes the "foldr/build"
rule to every structure expressed as a fixed-point.
This post will introduce (quickly) the three needed concepts (namely deforestation, fixed-point structures, and recursion schemes) and how to cook them to obtain cool results. If you are in a hurry and familiar with these concepts, jump to the core section.
Deforestation
The concept of deforestation was introduced by Wadler in his paper “Deforestation: transforming programs to eliminate trees”, in Theoretical Computer Science 73, 2 (1990), 231 – 248.
It is the idea that, when working with “good” functions on recursive structures, one can often optimize their composition to avoid building intermediate structures.
A typical example on lists is the expression:
map f
will destruct the original list and produce a new one, which will be destructed right away by the outer map
. Knowing the definition of map
, it is easy to see that the above expression is equivalent to a more efficient one:
that go through the list only once.
Deforestation is implemented in GHC for functions working on lists using the rewrite rules system (that is, the above optimization is made automatically by the compiler; see this previous blog post).
However, the rewriting operation we made in this example do not depends on the list structure: we just used the fact that we knew how the list produced by map
is made. This concept can be simply generalized to any recursive structure (as theoretically explained by Johann in “A Generalization of Short-Cut Fusion and Its Correctness Proof”, Higher-Order and Symbolic Computation (2002) 15: 273).
Currently, if one wants to add deforestation to another structure (such as algebraic graphs) then he needs to add a new set of rules. This is boring, hard to maintain and complicated for a neophyte. The idea is instead to generalize the "foldr/build"
rule to catch all these cases.
Firstly, we need a way to think about recursive structures generically: this is made using fixed-points.
The Fix operator
A useful way to think about recursive structures is using fixed-points: they are really often used in mathematics and can be used for types. All the next section was made after the already mentioned recursion-schemes package.
Let me jump right into the subject: the fixed-point operator in Haskell can be defined as:
You can find this definition in the Tungsten.Fix
module.
What is this strange thing?
The constructor Fix
is of type f (Fix f) -> Fix f
: it basically “hides” a level of application of the base type. You can think Fix f
as a type f
applied to itself indefinitely (this is exactly the usual mathematical definition of a fixed-point).
How the hell am I using this?
The simplest thing here is to take an example (everything here can be found in the Tungsten.Structure.List
module).
Linked lists can be defined as:
How is it working? The ListF
data represent the two constructors of lists: either the empty one (NilF
) or an element and something else (ConsF
), if you replace “something else” by another list, you have a linked list.
For example
So, all lists are in the fixed-point of ListF
, that is type List a = Fix (ListF a)
.
Then, we can define:
Note that being a newtype
, Fix
is optimized away at compile time: there will be no Fix
references in the code produced by the simplifier. This is due to the great work of Breitner et al. described in “Safe Zero-cost Coercions for Haskell” 2014. SIGPLAN Not. 49, 9 (Aug. 2014), 189–202.
Recursion-schemes
Deforestation is about destruction. Consequently, the next question is: how do we destruct fixed-point structures?
It is cata-strophic
foldr
is a pretty general pattern (who didn’t write a fold_tree
or a fold_my_custom_struct
?). It is so general that there is a whole theory behind this process: recursion-schemes. They were popularized by Meijer et al. in their famous “Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire”, in Hughes J. (eds) Functional Programming Languages and Computer Architecture. FPCA 1991.
I am not going to explore them deeply, there is plenty of well-made blog posts about them and a beautiful library called, well, recursion-schemes.
Particularly, functions that “destruct recursively” (such as foldr
for lists) are called catamorphisms. Having this vocabulary, the idea of deforestation can be (roughly) expressed as: if a structure is built recursively and after destructed by a catamorphism, then the intermediate structure does not need to be built.
Catamorphisms for fixed-point structures
Catamorphisms for fixed-point types can be expressed using cata
:
It is simple to use. If you have a (non recursive) structure (of type f
, for example ListF u v
) and you want to destruct its fixed-point (Fix f
, for example List u
) to produce something of type a
, you only have to give a function that performs one step of the recursion, that is a function of type f a -> a
(for lists, that is a function of type ListF u a -> a
).
One can think cata go
as just a function replacing every occurrence of Fix
by go
in a fixed-point structure.
Note that the Functor
instance is needed to perform recursion since we only know that the structure is a fixed-point, but we don’t have access to its constructors.
An example
As an example of a catamorphism, let us write the function elem :: Eq a => a -> List a -> Bool
:
The rule(s)
(Disclaimer: I have removed all the INLINE pragmas and phase control to simplify the reading. Don’t hesitate to have a look at the code to see the details.)
Ok, now, the cool (and new) part.
The main idea is that one can abstract the build of a fixed-point structure with respect to Fix
(just like the build
function of lists abstract the two lists constructors, see this previous blog post). Then, because cata go
is just replacing every occurrence of Fix
by go
in the structure, the intermediate structure can be avoided.
This is done by buildR
, which can be defined as:
The forall
guarantees that the rule is valid (see here).
Because buildR
abstracted Fix
, we can then replace Fix
by a function go
if the structure is destructed right away by cata go
.
This is exactly the purpose of the "cata/buildR"
rule:
With this single rule, deforestation is provided to every fixed-point structures.
Examples
The "foldr/map"
optimization is just a particular case of "cata/buildR"
(that is, for example, foldr f z . map g
will be optimized in foldr (f . g) z
).
Here are the definitions (note that foldr
is just cata
and map
is defined in terms of buildR
):
Then we have:
Another example
Ok this is cool, but we already had all of this for classical lists (with the "foldr/build"
rule)! Here is an example with binary trees (you can find definitions in the Tungsten.Structure.Tree
module).
Ready? Let us work with bind
for binary trees.
bind
might seems strange, but if you replace fix'
by Fix
, all is normal (remember, cata fix == id
).
Then, the composition of several bind
will be optimized using its associativity without anything to do:
With the cata/buildR
rule, dBind
will produce the same Core code as dBindR
! dBindR
is indeed faster than dBind
since the inner bind
has direct access to the result of f
without going through the original structure.
Proof
Is this optimization valid? Is it proved?
The answer is yes. You can find a more-or-less formal proof in the tungsten
repository. Warning, seq
and undefined
are not good friends here, and they can break the rule (as for "foldr/build"
).
Another rule
Now if there is no rewriting? All expressions made with buildR
might be rewritten to cata Fix x
which corresponds to a totally useless destruction/reconstruction of the structure.
That is why Tungsten.Fix
also provides another rule:
{-# RULE
-- We cannot target `Fix` since it will be optimized away
"cata/id"
cata coerce = id
#-}
The only difficult step is to see that, because Fix
is a newtype
, the only way to target it is using coerce
(from Data.Coerce
).
Tests
All of this seems theoretically ok, but does the optimization actually occurs? To test that rules are correctly firing, I use the great inspection-testing
package which allows comparing GHC’s Core code of two expressions. All examples presented in this post (and many more) are tested this way (you can see all tests here).
Conclusion
This was pretty fun to write.
Fortunately, there are many paths to explore:
- The fusion for paramorphisms (aka. generalized catamorphisms) needs more work to be usable.
- For the rules to fire, we need to INLINE pretty much everything. It may increase the size of compiled code. Maybe there is a way to avoid this.
- There used to be a
foldr2
rule for functions that perform recursion on two lists simultaneously. Is it possible to writecata2
and optimize its composition withbuildR
?
Thanks
Thanks to Andrey Mokhov who suggested the idea and proofread this article.