Pareto and Pandas

This post muses about what it means to learn a software library. I’ll use Pandas as an example, but the post isn’t just about Pandas.

Suppose you say “I want to learn Pandas.” That implicitly assumes Pandas one thing, and in a sense it is. In another sense Pandas is hundreds of things.

At the top level, the pandas module (version 1.2.0) has 142 things inside.

    >>> import pandas as pd
    >>> len(dir(pd))
    142

The two most important things inside are the Series and DataFrame objects. They each in turn contain hundreds of things.

    >>> len(dir(pd.Series))
    434
    >>> len(dir(pd.DataFrame))
    441

That’s evidence Pandas’ diversity. But here’s evidence of it’s unity: most of the things inside these two objects have the same names.

    >>> s = set(dir(pd.Series))
    >>> d = set(dir(pd.DataFrame))
    >>> len(s.union(d))
    491
    >>> len(s - d)
    50
    >>> len(d - s)
    57

Pandas kinda has a fractal dimension, having both complexity and unity. The best way to think about it is not as one monolithic thing, or as hundreds of isolated things. It’s a coherent, but not perfectly coherent, collection of related things. This is true of all software libraries. Pandas is more coherent than most libraries because it was initially the product of one mind, that of Wes McKinney.

This has a couple implications for what it means to “learn Pandas.” Because Pandas is big, you have to explore it strategically, not exhaustively. And because Pandas is coherent, part of what it means to learn Pandas is to develop a feel for the way Pandas does things.

No one is going to learn Pandas by studying every object, every method on every object, and every argument to every method on every object. It’s too big. That’s also unnecessary.

There’s probably something like a Pareto distribution on the usefulness of features. The most commonly used features are used far, far more often than the most obscure features.

It would be interesting to do some kind of survey to see which features are actually used and how often. But I don’t think that’s practical. The easiest thing to do would be to find some large code base that heavily uses Pandas. But that’s not typical of how Pandas is used. Probably most lines of code using Pandas are scattered over millions of small scripts, much of it not in production code.

A well-designed library makes it possible to make good guesses about functionality you haven’t used. You learn the gestalt of the library. You can always look up API documentation as needed, but you can’t develop an intuition for a library just-in-time.

“Learn Pandas” is a daunting goal, and maybe an impossible goal if by “learn” you mean explore exhaustively. But “learn how to do my common tasks quickly in Pandas” and “develop a feel for how to do things in Pandas” are much smaller tasks.

Related posts

Expressiveness

Programmers like highly expressive programming languages, but programming managers do not. I wrote about this on Twitter a few months ago.

Q: Why do people like Lisp so much?

A: Because Lisp is so expressive.

Q: Why don’t teams use Lisp much?

A: Because Lisp is so expressive.

Q: Why do programmers complain about Java?

A: Because it’s not that expressive.

Q: Why do businesses use Java?

A: Because it’s not that expressive.

A highly expressive programming language offers lots of options. This can be a good thing. It makes programming more fun, and it can lead to better code. But it can also lead to more idiosyncratic code.

A large programming language like Perl allows developers to carve out language subsets that hardly overlap. A team member has to learn not only the parts of the language he understands and wants to use, but also all the parts that his colleagues might use. And those parts that he might accidentally use.

While Perl has maximal syntax, Lisp has minimal syntax. But Lisp is also very expressive, albeit in a different way. Lisp makes it very easy to extend the language via macros. While Perl is a big language, Lisp is an extensible language. This can also lead to each programmer practically having their own language.

With great expressiveness comes great responsibility. A team using a highly expressive language needs to develop conventions for how the language will be used in order to avoid fracturing into multiple de facto languages.

But what if you’re a team of one? Now you don’t need to be as concerned how other people use your language. You still may need to care somewhat. You want to be able to grab sample code online, and you may want to share code or ask others for help. It pays not to be entirely idiosyncratic, though you’re free to wander further from the mainstream.

Even when you’re working in a team, you still may have code that only you use. If your team is producing C# code, and you secretively use a Perl script to help you find things in the code, no one needs to know. On the other hand, there’s a tendency for personal code to become production code, and so personal tools in a team environment are tricky.

But if you’re truly working by yourself, you have great freedom in your choice of tools. This can take a long time to sort out when you leave a team environment to strike out on your own. You may labor under your previous restrictions for a while before realizing they’re no longer necessary. At the same time, you may choose to stick to your old tools, not because they’re optimal for your new situation, but because it’s not worth the effort to retool.

Related posts

(Regarding the last link, think myth as in Joseph Campbell, not myth as in Myth Busters.)

From shell to system

Routine computer tasks and system programming require different tools, though I’m not entirely sure why.

Many people have thought about how inconsistent shells and system programming languages are and tried to unite them. Wouldn’t it be nice to use one language for everything? But attempts to bring system languages down to the shell, or to push shell programming up to large programs, have not been very successful.

I learned Perl in college so I wouldn’t have to learn shell programming. That’s what Perl was initially designed to be: an alternative to shell scripting. Larry Wall called Perl a “distillation of Unix culture.”

Perl is the most disliked programming language according to Stack Overflow. And yet I imagine many who complain about Perl gladly use the menagerie of quirky tools that Perl was created to unify. Bash is popular while Perl is unpopular, and yet the quirkiest parts of Perl are precisely those it shares with bash.

I expect much of the frustration with Perl comes from using it as a language for writing larger programs. Perl is very terse and expressive. These features are assets for one-liners and individual use. They are liabilities for large programs and team development.

Compared to a system programming language like Java, Perl is complex, inconsistent, and unsafe. But compared to shell scripting, Perl is simple, consistent, and safe!

Related posts

Software analysis and synthesis

People who haven’t written large programs think that writing software is easy. All you have to do is break a big problem into smaller problems until you have something so small that it’s easy to program.

The problem is putting the pieces back together. If you’ve only written small programs, you haven’t had many pieces to put together. It’s harder to put the pieces together when you write a large program by yourself. It’s even harder when you work on a large program with other people.

Synthesis is harder than analysis. Or as Perdita Stevens put it, integration is harder than separation.

The image above is a screenshot from her keynote at the RC2020 conference on reversible computation.

Related post: The cost of taking things apart and putting them back together.

Short essays on programming languages

I saw a link to So You Think You Know C? by Oleksandr Kaleniuk on Hacker News and was pleasantly surprised. I expected a few comments about tricky parts of C, and found them, but there’s much more. The subtitle of the free book is And Ten More Short Essays on Programming Languages. Good reads.

This post gives a few of my reactions to the essays, my even shorter essays on Kaleniuk’s short essays.

My C

The first essay is about undefined parts of C. That essay, along with this primer on C obfuscation that I also found on Hacker News today, is enough to make anyone run screaming away from the language. And yet, in practice I don’t run into any of these pitfalls and find writing C kinda pleasant.

I have an atypical amount of freedom, and that colors my experience. I don’t maintain code that someone else has written—I paid my dues doing that years ago—and so I simply avoid using any features I don’t fully understand. And I usually have my choice of languages, so I use C only when there’s a good reason to use C.

I would expect that all these dark corners of C would be accidents waiting to happen. Even if I don’t intentionally use undefined or misleading features of the language, I could use them accidentally. And yet in practice that doesn’t seem to happen. C, or at least my personal subset of C, is safer in practice than in theory.

APL

The second essay is on APL. It seems that everyone who programs long enough eventually explores APL. I downloaded Iverson’s ACM lecture Notation as a Tool of Thought years ago and keep intending to read it. Maybe if things slow down I’ll finally get around to it. Kaleniuk said something about APL I hadn’t heard before:

[APL] didn’t originate as a computer language at all. It was proposed as a better notation for tensor algebra by Harvard mathematician Kenneth E. Iverson. It was meant to be written by hand on a blackboard to transfer mathematical ideas from one person to another.

There’s one bit of notation that Iverson introduced that I use fairly often, his indicator function notation described here. I used it a report for a client just recently where it greatly simplified the write-up. Maybe there’s something else I should borrow from Iverson.

Fortran

I last wrote Fortran during the Clinton administration and never thought I’d see it again, and yet I expect to need to use it on a project later this year. The language has modernized quite a bit since I last saw it, and I expect it won’t be that bad to use.

Apparently Fortran programmers are part of the dark matter of programmers, far more numerous than you’d expect based on visibility. Kaleniuk tells the story of a NASA programming competition in which submissions had to be written in Fortran. NASA cancelled the competition because they were overwhelmed by submissions.

Syntax

In his last essay, Kaleniuk gives some ideas for what he would do if he were to design a new language. His first point is that our syntax is arbitrarily constrained. We still use the small collection of symbols that were easy to input 50 years ago. As a result, symbols are highly overloaded. Regular expressions are a prime example of this, where the same character has to play multiple roles in various contexts.

I agree with Kaleniuk in principle that we should be able to expand our vocabulary of symbols, and yet in practice this hasn’t worked out well. It’s possible now, for example, to use λ than lambda in source code, but I never do that.

I suspect the reason we stick to the old symbols is that we’re stuck at a local maximum: small changes are not improvements. A former client had a Haskell codebase that used one non-ASCII character, a Greek or Russian letter if I remember correctly. The character was used fairly often and it did made the code slightly easier to read. But it wreaked havoc with the tool chain and eventually they removed it.

Maybe a wholehearted commitment using more symbols would be worth it; it would take no more effort to allow 100 non-ASCII characters than to allow one. For that matter, source code doesn’t even need to be limited to text files, ASCII or Unicode. But efforts along those lines have failed too. It may be another local maximum problem. A radical departure from the status quo might be worthwhile, but there’s not a way to get there incrementally. And radical departures nearly always fail because they violate Gall’s law: A complex system that works is invariably found to have evolved from a simple system that worked.

Related posts

A wrinkle in Clojure

Bob Martin recently posted a nice pair of articles, A Little Clojure and A Little More Clojure. In the first article he talks about how spare and elegant Clojure is.

In the second article he shows how to write a program to list primes using map and filter rather than if and while. He approaches the example leisurely, showing how to arrive at the solution in small steps understandable to someone new to Clojure.

The second article passes over a small wrinkle in Clojure that I’d like to say a little about. Near the top of the post we read

The filter function also takes a function and a list. (filter odd? [1 2 3 4 5]) evaluates to (1 3 5). From that I think you can tell what both the filter and the odd? functions do.

Makes sense: given a bunch of numbers, we pick out the ones that are odd. But look a little closer. We start with [1 2 3 4 5] in square brackets and end with (1 3 5) in parentheses. What’s up with that? The article doesn’t say, and rightfully so: it would derail the presentation to explain this subtlety. But it’s something that everyone new to Clojure will run into fairly quickly.

As long as we’re vague about what “a bunch of numbers” means, everything looks fine. But when we get more specific, we see that filter takes a vector of numbers and returns a list of numbers [1]. Or rather it can take a vector, as it does above; it could also take a list. There are reasons for this, explained here, but it’s puzzling if you’re new to the language.

There are a couple ways to make the filter example do what you’d expect, to either have it take a vector and return a vector, or to have it take a list and return a list. Both would have interrupted the flow of an introductory article. To take a vector and return a vector you could run

    (filterv odd? [1 2 3 4 5])

This returns the vector [1 3 5].

Notice we use the function filterv rather than filter. If the article had included this code, readers would ask “Why does filter have a ‘v’ on the end? Why isn’t it just called ‘filter’?”

To take a list and return a list you could run

    (filter odd? '(1 2 3 4 5))

This returns the list (1 3 5).

But if the article had written this, readers would ask “What is the little mark in front of (1 2 3 4 5)? Is that a typo? Why didn’t you just send it the list?” The little mark is a quote and not a typo. It tells Clojure that you are passing in a list and not making a function call.

One of the core principles of Lisp is that code and data use the same structure. Everything is a list, hence the name: LISP stands for LISt Processing. Clojure departs slightly from this by distinguishing vectors and lists, primarily for efficiency. But like all Lisps, function calls are lists, where the first element of the list is the name of the function and the remaining elements are arguments [2]. Without the quote mark in the example above, the Clojure compiler would look in vain for a function called 1 and throw an exception:

CompilerException java.lang.RuntimeException: Unable to resolve symbol: quote in this context, compiling: …

Since vectors are unambiguously containers, never used to indicate function calls, there’s no need for vectors to be quoted, which simplifies the exposition in an introductory article.

More Lisp posts

[1] The filter function actually returns a lazy sequence, displayed at the REPL as a list. Another detail one would be wise to omit from an introductory article.

[2] In Clojure function calls provide arguments in a list, but function definitions gather arguments into vectors.

Computational survivalist

survival gear

Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism.

I’m not much of a computational survivalist, but I’ve come to appreciate such a perspective. It’s an efficiency/robustness trade-off, and in general I’ve come to appreciate the robustness side of such trade-offs more over time. It especially makes sense for consultants who find themselves working on someone else’s computer with no ability to install software. I’m not often in that position, but that’s kinda where I am on one project.

Example

I’m working on a project where all my work has to be done on the client’s laptop, and the laptop is locked down for security. I can’t install anything. I can request to have software installed, but it takes a long time to get approval. It’s a Windows box, and I requested a set of ports of basic Unix utilities at the beginning of the project, not knowing what I might need them for. That has turned out to be a fortunate choice on several occasions.

For example, today I needed to count how many times certain characters appear in a large text file. My first instinct was to write a Python script, but I don’t have Python. My next idea was to use grep -c, but that would count the number of lines containing a given character, not the number of occurrences of the character per se.

I did a quick search and found a Stack Overflow question “How can I use the UNIX shell to count the number of times a letter appears in a text file?” On the nose! The top answer said to use grep -o and pipe it to wc -l.

The -o option tells grep to output the regex matches, one per line. So counting the number of lines with wc -l gives the number of matches.

Computational minimalism

Computational minimalism is a variation on computational survivalism. Computational minimalists limit themselves to a small set of tools, maybe the same set of tools as computational survivalist, but for different reasons.

I’m more sympathetic to minimalism than survivalism. You can be more productive by learning to use a small set of tools well than by hacking away with a large set of tools you hardly know how to use. I use a lot of different applications, but not as many as I once used.

More computational tool posts

Professional, amateur, and something else

I opened a blog posts a while back by saying

One of the differences between amateur and professional software development is whether you’re writing software for yourself or for someone else. It’s like the difference between keeping a journal and being a journalist.

This morning I saw where someone pulled that quote and I thought about how I’m now somewhere that doesn’t fall into either category. I’m a professional, and a software developer, but not a professional software developer.

People pay me for work that may require writing software, but they usually don’t want my software per se. They want reports that may require me to write software to generate. My code is directly for my own use but indirectly for someone else. Occasionally I will have a client ask me for software, but not often.

In the last few years I’ve been asked to read software more often than I’ve been asked to write it. For example, I was an expert witness on an intellectual property case and had to read a lot of source code. But even that project fit into the pattern above: I wrote some code to help me do my review, but my client didn’t care about that code per se, only what I found with it.

Related posts

Curry-Howard-Lambek correspondence

Curry-Howard-Lambek is a set of correspondences between logic, programming, and category theory. You may have heard of the slogan proofs-as-programs or propositions-as-types. These refer to the Curry-Howard correspondence between mathematical proofs and programs. Lambek’s name is appended to the Curry-Howard correspondence to represent connections to category theory.

The term Curry-Howard isomorphism is often used but is an overstatement. Logic and programming are not isomorphic, nor is either isomorphic to category theory.

You’ll also hear the term computational trinitarianism. That may be appropriate because logic, programming, and category theory share common attributes without being identical.

The relations between logic, programming, and categories are more than analogies, but less than isomorphisms.

There are formal correspondences between parts of each theory. And aside from these rigorous connections, there is also a heuristic that something interesting in one area is likely to have an interesting counterpart in the other areas. In that sense Curry-Howard-Lambek is similar to the Langlands program relating number theory and geometry, a set of theorems and conjectures as well as a sort of philosophy.