I’m starting a new Twitter account @FunctorFact for functional programming and category theory.

These two subjects have a lot of overlap, and some tweets will combine both, but many will be strictly about one or the other. So some content will be strictly about programming, some strictly about math, and some combining ideas from both.

The other day I spoke to Chris Toomey from thoughtbot. Chris runs Upcase, thoughtbot’s online platform for learning about Rails, test-driven development, clean code, and more. I was curious about his work with Ruby on Rails since I know little about that world. And at a little deeper level, I wanted to get his thoughts on how programming languages are used in practice, static vs dynamic, strongly typed vs weakly typed, etc.

JC: Chris, I know you do a lot of work with Ruby on Rails. What do you think of Ruby without Rails? Would you be as interested in Ruby if the Rails framework had been written in some other language?

CT: Let me back up a little bit and give you some of my background. I started out as an engineer and I used VB because it was what I had for the task at hand. Then when I decided to buckle down and become a real developer I chose Python because it seemed like the most engineering-oriented alternative. It seemed less like an enterprise language, more small and nimble. I chose Python over Ruby because of my engineering background. Python seemed more serious, while Ruby seemed more like a hipster language. Ruby sounded frivolous, but I kept hearing good things about it, especially with Rails. So like a lot of people I came to Ruby through Rails. It was the functionality and ease of use that got me hooked, but I do love Ruby as a language, the beauty and expressiveness of it. It reads more like prose than other languages. It’s designed for people rather than machines. But it’s also a very large language and hard to parse because of that. Over time though I’ve seen people abuse the looseness, the freedom in Ruby, and that’s caused me to look at stricter options like Haskell and other functional languages.

JC: I only looked at Ruby briefly, and when I saw the relative number of numerical libraries for Python and Ruby I thought “Well, looks like it’s Python for me.”

It seems like Ruby bears some resemblance to Perl, for better or worse.

CT: Absolutely. Ruby has two spiritual ancestors. One is Perl and the other is Smalltalk. I think both of those are great, and many of the things I love about Ruby come from that lineage. Perl contributed the get-things-done attitude, the looseness and terseness, the freedom to interact at any level of abstraction.

It’s kinda odd. I keep coming back to The Zen of Python. One of the things it says is that explicit is better than implicit, and I really think that’s true. And yet I work in Ruby and Rails where implicit is the name of the game. So I have some cognitive dissonance over that. I love Ruby on Rails, but I’m starting to look at other languages and frameworks to see if something else might fit as well.

JC: Do you have the freedom to choose what language and framework you work in? Do clients just ask for a web site, or do they dictate the technology?

CT: We have a mix. A lot of clients just want a web app, but some, especially large companies, want us to use their technology stack. So while we do a lot of Rails, we also do some Python, Haskell, etc.

JC: Do you do everything soup-to-nuts or do you have some specialization?

CT: We have three roles at thoughtbot: designer, web developer, and mobile developer. The designers might do some JavaScript, but they mostly focused on user experience, testing, and design.

JC: How do you keep everything straight? The most intimidating thing to me about web development is all the diverse tools in play: the language for your logic, JavaScript, CSS, HTML, SQL, etc.

CT: There’s definitely some of that, but we outsource some parts of the stack. We host applications on Heroku, giving them responsibility for platform management. They run on top of AWS so they handle all the scaling issues so we can focus on the code. We’ll deploy to other environments if our client insists, but our preference is to go with Heroku.

Similarly, Rails has a lot of functionality for the database layer, so we don’t write a lot of SQL by hand. We’re all knowledgeable of SQL, but we’re not DBA-level experts. We scale up on that as necessary, but we want to focus on the application.

JC: Shifting gears a little bit, how do you program differently in a dynamic language like Ruby than you would in a stricter language like C++? And is that a good thing?

CT: One thing about Ruby, and dynamic languages in general, is that testing becomes all the more critical. There are a lot of potential runtime errors you have to test for. Whereas with something like Haskell you can program a lot of your logic into the type system. Ruby lets you work more freely, but Haskell leads to more robust applications. Some of our internal software at thoughtbot is written in Haskell.

JC: I was excited about using Haskell, but when I used it on a production project I ran into a lot of frustrations that you wouldn’t anticipate from working with Haskell in the small.

CT: Haskell does seem to have a more aggressive learning curve than other languages. There’s a lot of Academia in it, and in a way that’s good. The language hasn’t compromised its vision, and it’s been able to really develop some things thoroughly. But it also has a kind of academic heaviness to it.

There’s a language out there called Elm that’s inspired by Haskell and the whole ML family of languages that compiles down to JavaScript. It presents a friendlier interface to the whole type-driven, functional way of thinking. The developers of the language have put a lot of effort into making it approachable, without having to understand comonads and functors and all that.

JC: My difficulties with Haskell weren’t the theory but things like the lack of tooling and above all the difficulty of package management.

CT: Cabal Hell.

JC: Right.

CT: My understanding is that that’s improved dramatically with new technologies like Stack. We’re scaling up internally on Haskell. That’s the next area we’d like to get into. I’ll be able to say more about that down the road.

* * *

Check out Upcase for training materials on tools like Vim, Git, Rails, and Tmux.

These have all been numerical algorithms. Insertion sort is an example of a non-numerical algorithm that could be implemented as a fold.

Insertion sort is not the fastest sorting algorithm. It takes O(n^{2}) operations to sort a list of size n while the fastest algorithms take O(n log n) operations. But insertion sort is simple, and as it works its way down a list, the portion it has processed is sorted. If you interrupt it, you have correct results given the input so far. If you interrupt something like a quicksort, you don’t have a sorted list. If you’re receiving a stream of data points and want to sort them as they come, you have to use insertion sort rather than something like quicksort.

The function to fold over a list looks like this:

f as a = [x | x <- as, x < a] ++ [a] ++ [x | x <- as, x >= a]

Given a sorted list as and a value a, return a new sorted list that has inserted the element a in the proper position. Our function f does this by joining together the list of the elements less than a, the list containing only a, and the list of elements at least as big as a.

Here’s how we could use this to alphabetize the letters in the word “example.”

foldl f [] "example"

This returns "aeelmpx".

Haskell takes our function f and an empty list of characters [] and returns a new list of characters by folding f over the list of characters making up the string "example".

You can always watch how foldl works by replacing it with scanl to see intermediate results.

Folds in functional programming are often introduced as a way to find the sum or product of items in a list. In this case the fold state has the same type as the list items. But more generally the fold state could have a different type, and this allows more interesting applications of folds. Previous posts look at using folds to update conjugate Bayesian models and numerically solve differential equations.

This post uses a fold to compute mean, variance, skewness, and kurtosis. See this earlier post for an object-oriented approach. The code below seems cryptic out of context. The object-oriented post gives references for where these algorithms are developed. The important point for this post is that we can compute mean, variance, skewness, and kurtosis all in one pass through the data even though textbook definitions appear to require at least two passes. It’s also worth noting that the functional version is less than half as much code as the object-oriented version.

(Algorithms that work in one pass through a stream of data, updating for each new input, are sometimes called “online” algorithms. This is unfortunate now that “online” has come to mean something else.)

The Haskell function moments below returns the number of samples and the mean, but does not directly return variance, skewness and kurtosis. Instead it returns moments from which these statistics can easily be calculated using the mvks function.

The foldl applies moments first to its initial value, the 5-tuple of zeros. Then it iterates over the data, taking data points one at a time and visiting each point only once, returning a new state from moments each time. Another way to say this is that after processing each data point, moments returns the 5-tuple that it would have returned if that data only consisted of the values up to that point.

For a non-numerical example of folds, see my post on sorting.

One of the most widely used numerical algorithms for solving differential equation is the 4th order Runge-Kutta method. This post shows how the Runge-Kutta method can be written naturally as a fold over the set of points where the solution is needed. These points do not need to be evenly spaced.

Given a differential equation of the form

with initial condition y(t_{0}) = y_{0}, the 4th order Runge-Kutta method advances the solution by an amount of time h by

where

The Haskell code for implementing the accumulator function for foldl looks very much like the mathematical description above. This is a nice feature of the where syntax.

rk (t, y) t' = (t', y + h*(k1 + 2.0*k2 + 2.0*k3 + k4)/6.0)
where
h = t' - t
k1 = f t y
k2 = f (t + 0.5*h) (y + 0.5*h*k1)
k3 = f (t + 0.5*h) (y + 0.5*h*k2)
k4 = f (t + 1.0*h) (y + 1.0*h*k3)

Suppose we want to solve the differential equation y ‘ = (t^{2} – y^{2}) sin(y) with initial condition y(0) = -1, and we want to approximate the solution at [0.01, 0.03, 0.04, 0.06]. We would implement the right side of the equation as

f t y = (t**2 - y**2)*sin(y)

and fold the function rk over our time steps with

foldl rk (0, -1) [0.01, 0.03, 0.04, 0.06]

This returns (0.06, -0.9527). The first part, 0.06, is no surprise since obviously we asked for the solution up to 0.06. The second part, -0.9527, is the part we’re more interested in.

If you want to see the solution at all the specified points and not just the last one, replace foldl with scanl.

As pointed out in the previous post, writing algorithms as folds separates the core of the algorithm from data access. This would allow us, for example, to change rk independently, such as using a different order Runge-Kutta method. (Which hardly anyone does. Fourth order is a kind of sweet spot.) Or we could swap out Runge-Kutta for a different ODE solver entirely just by passing a different function into the fold.

Next see how you could use a fold to compute mean, variance, skewness, and kurtosis in one pass through a stream of data using a fold.

Bayesian calculations are intrinsically recursive: The posterior distribution from one step becomes the prior for the next step.

With conjugate models, the prior and posterior belong to the same family of distributions. If a distribution from this family is determined by a fixed set of parameters, we only need to update these parameters at each step. This updating process is defined by an integration problem, i.e. applying Bayes’ theorem, but it’s common with conjugate models for the integration to reduce to a simple algebraic operation on the parameters.

For example, suppose some binary event that happens with probability θ where θ has a beta(α, β) distribution. When we take our next observation, the posterior distribution becomes beta(α+1, β) if the event occurred and beta(α, β+1) if it didn’t. There is an integration problem in the background that reduces to simply adding a 1 to the α parameter for every success and to the β parameter for every failure. (Standard terminology is to call an observation of the thing you’re interested in a “success” regardless of whether the thing you’re observing is good or bad. “Failure” is just an observation of the thing not happening, even if it’s good that it didn’t happen.)

Conjugate models have the structure of a fold in functional programming, also known as a reduce or accumulate operation. We’ll show below how to compute the posterior distribution in a beta-binomial and normal-normal model using folds in Haskell. (Technically a left associative fold, foldl. Haskell also has a right associative fold foldr that we won’t be concerned about here.)

Folds in Haskell

What does foldl require? Let’s ask GHCi, the Haskell REPL. Prior to GHC version 7.10 we would get the following.

ghci> :type foldl
foldl :: (a -> b -> a) -> a -> [b] -> a

Unfortunately the type variables a and b just mean “one type” and “a possibly different type” and tell us nothing about how the types are used, so let’s unravel this a little at a time.

Here “a” is the type of the accumulator state. In our case, it will be the prior and posterior distribution parameters because that’s what’s being updated for each data point. Conjugacy makes this work since the prior and posterior distributions belong to the same family. The “b” type is our data. So you could read the type of foldl as follows.

foldl takes three things: a function to fold over the data, an initial accumulator state, and an array of data. It returns the final accumulator state. The accumulator function takes an accumulator state and a data value, and returns a new accumulator state.

Starting with GHC 7.10 the type of foldl is a little more general.

ghci> :type foldl
foldl :: Foldable t => (a -> b -> a) -> a -> t b -> a

Instead of foldl operating only on lists, it can operate on any “foldable” container. To recover the type from earlier versions of the compiler, replace t with []. (Haskell uses [] b and [b] as equivalent ways of indicating a list of elements of type b.) There is one other difference I’ve edited out: The latest version of GHCi reverses the roles of ‘a’ and ‘b’. This is confusing, but has no effect since the type variable names have no meaning. I swapped the a’s and b’s back like they were to make the two definitions easier to compare.

Beta-binomial example

Returning to the beta-binomial model discussed above, suppose we have a sequence of observations consisting of a 1 when the event in question happened and a 0 when it did not. And suppose our prior distribution on the probability of the event has a beta(α, β) distribution.

Now what function will we fold over our data to update our prior into a posterior?

ghci> let f params y = (fst params + y, snd params + 1 - y)

The function f takes a pair of numbers, the two beta distribution parameters, and a data point y, and returns an updated set of beta parameters. If the data point is a 1 (success) then the α> parameter is incremented, and if the data point is a 0 (failure) then the β parameter is incremented. If we start with a beta(0.2, 0.8) prior and observe the data [1, 1, 0, 0, 1] then we apply f to our data as follows.

ghci> let d = [1, 1, 0, 0, 1]
ghci> foldl f (0.2, 0.8) d

The result will be (3.2, 2.8). The three successes incremented the first beta parameter and the two failures incremented the second beta parameter.

To see how the foldl operates, we can run scanl instead. It works like foldl but returns a list of intermediate accumulator values rather than just the final accumulator value.

he initial accumulator state is (0.2, 0.8) because we started with a beta(0.2, 0.8) prior. Then we increment the accumulator state to (1.2, 0.8) because the first data point was a 1. Then we increment the state to (2.2, 0.8) because the second data point is also a 1. Next we see two failures in a row and so the second part of the accumulator is incremented two times. After observing the last data point, a success, our final state is (3.2, 2.8), just as when we applied foldl.

Normal-normal example

Next we consider another Bayesian model with a conjugate prior. We assume our data come from a normal distribution with mean θ and known variance 1. We assume a conjugate prior distribution on θ, normal with mean μ_{0} and variance τ_{0}^{2}.

This time the posterior distribution calculation is more complicated and so our accumulator function is more complicated. But the form of the calculation is the same as above: we fold an accumulator function over the data.

Here is the function that encapsulates our posterior distribution calculation.

f params y = ((m/v + y)*newv, newv)
where
m = fst params -- mean
v = snd params -- variance
newv = v/(v + 1)

Now suppose our prior distribution on θ has mean 0 and variance 4, and we observe two values, [3, 5].

ghci> foldl f (0, 4) [3, 5]

This returns (3.5555, 0.4444). To see the intermediate accumulation state, i.e. after just seeing y = 3, we run scanl instead and see

[ (0, 4), (2,4, 0,8), (3.5555, 0.4444) ]

Inspiration

This post was inspired by a paper by Brian Beckman (in progress) that shows how a Kalman filter can be implemented as a fold. From a Bayesian perspective, the thing that makes the Kalman filter work is that a certain multivariate normal model has a conjugate prior. This post shows that conjugate models more generally can be implemented as folds over the data. That’s interesting, but what does it buy you? Brian’s paper discusses this in detail, but one advantage is that it completely separates the accumulator function from the data, so the former can be tested in isolation.

Next up: ODEs

The next post shows how to implement an ODE solver (4th order Runge-Kutta) as a fold over the list of time steps where the solution is needed.

It would be hard to think of two programming languages more dissimilar than Haskell and R.

Haskell was designed to enforce programming discipline; R was designed for interactive use. Haskell emphasizes correctness; R emphasizes convenience. Haskell emphasizes computer efficiency; R emphasizes interactive user efficiency. Haskell was written to be a proving ground for programming language theorists. R was written to be a workbench for statisticians. Very different goals lead to very different languages.

When I first heard of a project to mix Haskell and R, I was a little shocked. Could it even be done? Aside from the external differences listed above, the differences in language internals are vast. I’m very impressed that the folks at Tweag I/O were able to pull this off. Their HaskellR project lets you call R from Haskell and vice versa. (It’s primarily for Haskell calling R, though you can call Haskell functions from your R code: Haskell calling R calling Haskell. It kinda hurts your brain at first.) Because the languages are so different, some things that are hard in one are easy in the other.

I used HaskellR while it was under initial development. Our project was written in Haskell, but we wanted to access R libraries. There were a few difficulties along the way, as with any project, but these were resolved and eventually it just worked.

Haskell advocates are fond of saying that a Haskell function cannot launch missiles without you knowing it. Pure functions have no side effects, so they can only do what they purport to do. In a language that does not enforce functional purity, calling a function could have arbitrary side effects, including launching missiles. But this cannot happen in Haskell.

The difference between pure functional languages and traditional imperative languages is not quite that simple in practice.

Programming with pure functions is conceptually easy but can be awkward in practice. You could just pass each function the state of the world before the call, and it returns the state of the world after the call. It’s unrealistic to pass a program’s entire state as an argument each time, so you’d like to pass just that state that you need to, and have a convenient way of doing so. You’d also like the compiler to verify that you’re only passing around a limited slice of the world. That’s where monads come in.

Suppose you want a function to compute square roots and log its calls. Your square root function would have to take two arguments: the number to find the root of, and the state of the log before the function call. It would also return two arguments: the square root, and the updated log. This is a pain, and it makes function composition difficult.

Monads provide a sort of side-band for passing state around, things like our function call log. You’re still passing around the log, but you can do it implicitly using monads. This makes it easier to call and compose two functions that do logging. It also lets the compiler check that you’re passing around a log but not arbitrary state. A function that updates a log, for example, can effect the state of the log, but it can’t do anything else. It can’t launch missiles.

Once monads get large and complicated, it’s hard to know what side effects they hide. Maybe they can launch missiles after all. You can only be sure by studying the source code. Now how do you know that calling a C function, for example, doesn’t launch missiles? You study the source code. In that sense Haskell and C aren’t entirely different.

The Haskell compiler does give you assurances that a C compiler does not. But ultimately you have to study source code to know what a function does and does not do.

Sweave and Pweave are programs that let you embed R and Python code respectively into LaTeX files. You can display the source code, the result of running the code, or both.

lhs2TeX is roughly the Haskell analog of Sweave and Pweave. This post takes the sample code I wrote for Sweave and Pweave before and gives a lhs2TeX counterpart.

\documentclass{article}
%include polycode.fmt
%options ghci
\long\def\ignore#1{}
\begin{document}
Invisible code that sets the value of the variable $a$.
\ignore{
\begin{code}
a = 3.14
\end{code}
}
Visible code that sets $b$ and squares it.
(There doesn't seem to be a way to display the result of a block of code directly.
Seems you have to save the result and display it explicitly in an eval statement.)
\begin{code}
b = 3.15
c = b*b
\end{code}
$b^2$ = \eval{c}
Calling Haskell inline: $\sqrt{2} = \eval{sqrt 2}$
Recalling the variable $a$ set above: $a$ = \eval{a}.
\end{document}

If you save this code to a file foo.lhs, you can run

lhs2TeX -o foo.tex foo.lhs

to create a LaTeX file foo.tex which you could then compile with pdflatex.

One gotcha that I ran into is that your .lhs file must contain at least one code block, though the code block may be empty. You cannot just have code in \eval statements.

Unlike R and Python, the Haskell language itself has a notion of literate programming. Haskell specifies a format for literate comments. lhs2TeX is a popular tool for processing literate Haskell files but not the only one.

People often dislike static type systems because they’ve only met weak ones. A weak or not very expressive type system gets in your way all the time. It prevents you from writing functions you want to write that you know are fine. … The solution is not to abandon the type system but to make the type system more expressive.

In particular, he mentions Haskell’s polymorphic types and type inference as ways to make strong static typing convenient to use.

Fortran superficially looks like mathematics. Its name comes from “FORmula TRANslation,” and the language does provide a fairly straight-forward way to turn formulas into code. But the similarity to mathematics ends there at the lowest level of abstraction.

Haskell, on the other hand, is more deeply mathematical. Fortran resembles math in the small, but Haskell resembles math in the large. Haskell has mathematical structures at a higher level of abstraction than just formulas. As a recent article put it

While Fortran provides a comfortable translation of mathematical formulas, Haskell code begins to resemble mathematics itself.

On its surface, Perl is one of the least English-like programming languages. It is often criticized as looking like “line noise” because of its heavy use of operators. By contrast, Python has been called “executable pseudocode” because the source is easier to read, particularly for novice programmers. And yet at a deeper level, Perl is more English-like than other programming languages such as Python.

Larry Wall explains in Natural Language Principles in Perl that he designed Perl to incorporate features of spoken language. For example, Perl has analogs of articles and pronouns. (Larry Wall studied both linguistics and computer science in college.) Opinions differ on how well his experiment with incorporating natural language features into programming languages has worked out, but it was an interesting idea.

Haskell uses a lot of ideas from category theory, but the correspondence between Haskell and category theory can be a little hard to see at times.

One difficulty is that although Haskell articles use terms like functor and monad from category theory, they seldom actually talk about categories per se. If we’ve got functors, where are the categories? (This reminds me of Darth Vader asking “If this is a consular ship, where is the ambassador?”)

In Haskell literature, everything implicitly lives Hask, the category of Haskell types, or in some subcategory of Hask. This means that the category itself is not the focus of attention. In category theory, functors often operate between very different classes of objects, such as topological spaces and their fundamental groups, and so it’s more important to state what category something lives in.

Another potential stumbling block is to think of Haskell types as categories and values as objects. That would be reasonable, since in computer science an “object” is an instance of a type. But the right correspondence is to think of Haskell types as categorical objects. Instances of types are below the level of abstraction we’re working at. This is analogous to how category theory treats objects as black boxes with no way to talk about what’s inside.

Finally, Haskell monads look a little different from categorical monads. Haskell’s return corresponds directly to unit, usually written as η, in category theory. But Haskell monads have a bind operator >>= while mathematical monads have a join operator μ. These are not equivalent, though you can implement each in terms of the other:

join :: Monad m => m (m a) -> m a
join x = x >>= id
(>>=) :: Monad m => m a -> (a -> m b) -> m b
x >>= f = join (fmap f x)

To read more along these lines, see the Wikibooks article on Haskell and Category theory.

Update: Stephen Diehl suggested I mention the differences between the idealized category Hask and the implementation of the Haskell language. These are discussed here.

In the context of programming languages, “magic” is often a pejorative term for code that does something other than what it appears to do.

Programmers seem to have a love/hate relationship with magic. Even people who say that don’t like magic (e.g. because it’s hard to debug) end up using it. The Haskell community prides itself on having a transparent language with no magic, and yet monads are slightly magical. The whole purpose of a monad is to hide explicit data flow, though in a principled way. Haskell’s do notation is more magical, and templates are even more magical still. (However, I do hear some Haskellers express disdain for templates.)

People who like magic tend to use the word “automagic” instead. It means about the same thing as “magic” but with a positive connotation.

To conclude with a couple sweeping generalizations, magic fans tend to be tool-oriented (such as Microsoft developers) while magic detractors tend to be language-oriented (such as Haskell developers ).

Update: Someone asked me on Twitter about the difference between abstraction and magic. I’d say abstraction hides details, but magic is actively misleading or ironic.

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

Pure functional languages have this advantage: all flow of data is made explicit. And this disadvantage: sometimes it is painfully explicit.

That’s the problem monads solve: they let you leave implicit some of the repetitive code otherwise required by functional programming. That simple but critical point left out of many monad tutorials.

Dan Piponi wrote a blog post You Could Have Invented Monads! (And Maybe You Already Have) very much in the spirit of Wadler’s paper. He starts with the example of adding logging to a set of functions. You could have every function return not just the return value that you’re interested in but also the updated state of the log. This quickly gets messy. For example, suppose your basic math functions write to an error log if you call them with illegal arguments. That means your square root function, for example, has to take as input not just a real number but also the state of the error log before it was called. Monads give you a way of effectively doing the same thing, but conceptually separating the code that takes square roots from the code that maintains the error log.

… we can ask Coq to “extract,” from a Definition, a program in some other, more conventional, programming language (OCaml, Scheme, or Haskell) with a high-performance compiler.

Most programmers would hardly consider OCaml, Scheme, or Haskell “conventional” programming languages, but they are conventional relative to Coq. As the authors said, these languages are “more conventional,” not “conventional.”

I don’t mean to imply anything negative about OCaml, Scheme, or Haskell. They have their strengths—I briefly mentioned the advantages of Haskell just yesterday—but they’re odd birds from the perspective of the large majority of programmers who work in C-like languages.