This post introduces `tungsten`, a minimalist library that provides deforestation for free to any structure expressed as a fixed-point. It is proved using free theorems and tested using `inspection-testing`.

## Introduction

I worked earlier this year on bringing deforestation specifically to algebraic graphs, recounted in this past blog post. The whole process was basically copy-pasting things from the `GHC.Base` module that works on lists and adapting them to work on graphs.

As my first-year professor once told me:

Never copy-paste code. If you copy-paste a line of code, you are either introducing a bug or missing an opportunity to generalize your program.

So, as I was not introducing a bug, it seems that I was missing an opportunity to generalize these rules. The theory already showed that this was possible but there was no (known to me) usable implementation. As the title suggests, I came to writing `tungsten`, a library (not yet on Hackage) that generalizes the `"foldr/build"` rule to every structure expressed as a fixed-point.

This post will introduce (quickly) the three needed concepts (namely deforestation, fixed-point structures, and recursion schemes) and how to cook them to obtain cool results. If you are in a hurry and familiar with these concepts, jump to the core section.

## Deforestation

The concept of deforestation was introduced by Wadler in his paper “Deforestation: transforming programs to eliminate trees”, in Theoretical Computer Science 73, 2 (1990), 231 – 248.

It is the idea that, when working with “good” functions on recursive structures, one can often optimize their composition to avoid building intermediate structures.

A typical example on lists is the expression:

`map f` will destruct the original list and produce a new one, which will be destructed right away by the outer `map`. Knowing the definition of `map`, it is easy to see that the above expression is equivalent to a more efficient one:

that go through the list only once.

Deforestation is implemented in GHC for functions working on lists using the rewrite rules system (that is, the above optimization is made automatically by the compiler; see this previous blog post).

However, the rewriting operation we made in this example do not depends on the list structure: we just used the fact that we knew how the list produced by `map` is made. This concept can be simply generalized to any recursive structure (as theoretically explained by Johann in “A Generalization of Short-Cut Fusion and Its Correctness Proof”, Higher-Order and Symbolic Computation (2002) 15: 273).

Currently, if one wants to add deforestation to another structure (such as algebraic graphs) then he needs to add a new set of rules. This is boring, hard to maintain and complicated for a neophyte. The idea is instead to generalize the `"foldr/build"` rule to catch all these cases.

Firstly, we need a way to think about recursive structures generically: this is made using fixed-points.

## The Fix operator

A useful way to think about recursive structures is using fixed-points: they are really often used in mathematics and can be used for types. All the next section was made after the already mentioned recursion-schemes package.

Let me jump right into the subject: the fixed-point operator in Haskell can be defined as:

You can find this definition in the `Tungsten.Fix` module.

#### What is this strange thing?

The constructor `Fix` is of type `f (Fix f) -> Fix f`: it basically “hides” a level of application of the base type. You can think `Fix f` as a type `f` applied to itself indefinitely (this is exactly the usual mathematical definition of a fixed-point).

#### How the hell am I using this?

The simplest thing here is to take an example (everything here can be found in the `Tungsten.Structure.List` module).

Linked lists can be defined as:

How is it working? The `ListF` data represent the two constructors of lists: either the empty one (`NilF`) or an element and something else (`ConsF`), if you replace “something else” by another list, you have a linked list.

For example

So, all lists are in the fixed-point of `ListF`, that is `type List a = Fix (ListF a)`.

Then, we can define:

Note that being a `newtype`, `Fix` is optimized away at compile time: there will be no `Fix` references in the code produced by the simplifier. This is due to the great work of Breitner et al. described in “Safe Zero-cost Coercions for Haskell” 2014. SIGPLAN Not. 49, 9 (Aug. 2014), 189–202.

## Recursion-schemes

Deforestation is about destruction. Consequently, the next question is: how do we destruct fixed-point structures?

### It is cata-strophic

`foldr` is a pretty general pattern (who didn’t write a `fold_tree` or a `fold_my_custom_struct` ?). It is so general that there is a whole theory behind this process: recursion-schemes. They were popularized by Meijer et al. in their famous “Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire”, in Hughes J. (eds) Functional Programming Languages and Computer Architecture. FPCA 1991.

I am not going to explore them deeply, there is plenty of well-made blog posts about them and a beautiful library called, well, recursion-schemes.

Particularly, functions that “destruct recursively” (such as `foldr` for lists) are called catamorphisms. Having this vocabulary, the idea of deforestation can be (roughly) expressed as: if a structure is built recursively and after destructed by a catamorphism, then the intermediate structure does not need to be built.

### Catamorphisms for fixed-point structures

Catamorphisms for fixed-point types can be expressed using `cata`:

It is simple to use. If you have a (non recursive) structure (of type `f`, for example `ListF u v`) and you want to destruct its fixed-point (`Fix f`, for example `List u`) to produce something of type `a`, you only have to give a function that performs one step of the recursion, that is a function of type `f a -> a` (for lists, that is a function of type `ListF u a -> a`). One can think `cata go` as just a function replacing every occurrence of `Fix` by `go` in a fixed-point structure.

Note that the `Functor` instance is needed to perform recursion since we only know that the structure is a fixed-point, but we don’t have access to its constructors.

#### An example

As an example of a catamorphism, let us write the function `elem :: Eq a => a -> List a -> Bool`:

## The rule(s)

(Disclaimer: I have removed all the INLINE pragmas and phase control to simplify the reading. Don’t hesitate to have a look at the code to see the details.)

Ok, now, the cool (and new) part.

The main idea is that one can abstract the build of a fixed-point structure with respect to `Fix` (just like the `build` function of lists abstract the two lists constructors, see this previous blog post). Then, because `cata go` is just replacing every occurrence of `Fix` by `go` in the structure, the intermediate structure can be avoided. This is done by `buildR`, which can be defined as:

The `forall` guarantees that the rule is valid (see here).

Because `buildR` abstracted `Fix`, we can then replace `Fix` by a function `go` if the structure is destructed right away by `cata go`. This is exactly the purpose of the `"cata/buildR"` rule:

With this single rule, deforestation is provided to every fixed-point structures.

### Examples

The `"foldr/map"` optimization is just a particular case of `"cata/buildR"` (that is, for example, `foldr f z . map g` will be optimized in `foldr (f . g) z`).

Here are the definitions (note that `foldr` is just `cata` and `map` is defined in terms of `buildR`):

Then we have:

#### Another example

Ok this is cool, but we already had all of this for classical lists (with the `"foldr/build"` rule)! Here is an example with binary trees (you can find definitions in the `Tungsten.Structure.Tree` module).

Ready? Let us work with `bind` for binary trees.

`bind` might seems strange, but if you replace `fix'` by `Fix`, all is normal (remember, `cata fix == id`).

Then, the composition of several `bind` will be optimized using its associativity without anything to do:

With the `cata/buildR` rule, `dBind` will produce the same Core code as `dBindR`! `dBindR` is indeed faster than `dBind` since the inner `bind` has direct access to the result of `f` without going through the original structure.

### Proof

Is this optimization valid? Is it proved?

The answer is yes. You can find a more-or-less formal proof in the `tungsten` repository. Warning, `seq` and `undefined` are not good friends here, and they can break the rule (as for `"foldr/build"`).

### Another rule

Now if there is no rewriting? All expressions made with `buildR` might be rewritten to `cata Fix x` which corresponds to a totally useless destruction/reconstruction of the structure.

That is why `Tungsten.Fix` also provides another rule:

``````{-# RULE
-- We cannot target `Fix` since it will be optimized away
"cata/id"
cata coerce = id
#-}
``````

The only difficult step is to see that, because `Fix` is a `newtype`, the only way to target it is using `coerce` (from `Data.Coerce`).

### Tests

All of this seems theoretically ok, but does the optimization actually occurs? To test that rules are correctly firing, I use the great `inspection-testing` package which allows comparing GHC’s Core code of two expressions. All examples presented in this post (and many more) are tested this way (you can see all tests here).

## Conclusion

This was pretty fun to write.

Fortunately, there are many paths to explore:

• The fusion for paramorphisms (aka. generalized catamorphisms) needs more work to be usable.
• For the rules to fire, we need to INLINE pretty much everything. It may increase the size of compiled code. Maybe there is a way to avoid this.
• There used to be a `foldr2` rule for functions that perform recursion on two lists simultaneously. Is it possible to write `cata2` and optimize its composition with `buildR`?