Keeping data and code together with org-mode

With org-mode you can keep data, code, and documentation in one file.

Suppose you have an org-mode file containing the following table.

    #+NAME: mydata
    | Drug | Patients |
    |------+----------|
    |    X |      232 |
    |    Y |      351 |
    |    Z |      117 |

Note that there cannot be a blank line between the NAME header and the beginning of the table.

You can bring this table into Python simply by declaring it to be a variable in the header of a Python code block.

    #+begin_src python :var tbl=mydata :results output
    print(tbl)
    #+end_src

When you evaluate this block, you see that the table is imported as a list of lists.

    [['X', 232], ['Y', 351], ['Z', 117]]

Note that the column headings were not imported into Python. Now suppose you would like to retain the headers, and use them as column names in a pandas data frame.

    #+begin_src python :var tbl=mydata :colnames no :results output
    import pandas as pd
    df = pd.DataFrame(tbl[1:], columns=tbl[0])
    print(df, "\n")
    print(df["Patients"].mean())
    #+end_src

When evaluated, this block produces the following.

      Drug  Patients 
    0    X       232
    1    Y       351
    2    Z       117

    233.33333333333334

Note that in order to import the column names, we told org-mode that there are no column names! We did this with the header option

    :colnames no

This seems backward, but it makes sense. It says do bring in the first row of the table, even though it appears to be a column header that isn’t imported by default. But then we tell pandas that we want to make a data frame out of all but the first row (i.e. tbl[1:]) and we want to use the first row (i.e. tbl[0]) as the column names.

A possible disadvantage to keeping data and code together is that the data could be large. But since org files are naturally in outline mode, you could collapse the part of the outline containing the data so that you don’t have to look at it unless you need to.

Related posts

Naming probability functions

Given a random variable X, you often want to compute the probability that X will take on a value less than x or greater than x. Define the functions

FX(x) = Prob(Xx)

and

GX(x) = Prob(X > x)

What do you call F and G? I tend to call them the CDF (cumulative distribution function) and CCDF (complementary cumulative distribution function) but conventions vary.

The names of software functions to compute these two functions can be confusing. For example, Python (SciPy) uses the names cdf and sf (the latter for “survival function”) while the R functions to compute the CDF take an optional argument to return the CCDF instead [1].

In the Emacs calculator, the function ltpn computes the CDF. At first glace I thought this was horribly cryptic. It’s actually a very nice naming convention; it just wasn’t what I was expecting.

The “ltp” stands for lower tail probability and “n” stands for normal. The complementary probability function is utpn where “utp” stands for upper tail probability. Unlike other software libraries, Emacs gives symmetric names to these two symmetrically related functions.

“Lower tail” probability is clearer than “cumulative” probability because it leaves no doubt whether you’re accumulating from the left or the right.

You can replace the “n” at the end of ltpn and utpn with the first letters of binomial, chi-square, t, F, and Poisson to get the corresponding functions for these distributions. For example, utpt gives the upper tail probability for the Student t distribution [2].

The Emacs calculator can be quirky, but props to the developers for choosing good names for the probability functions.

Related posts

[1] Incidentally, the CCDF cannot always be computed by simply computing the CDF first and subtracting the value from 1. In theory this is possible, but not in floating point practice. See the discussion of erf and erfc in this post for an explanation.

[2] These names are very short and only a handful of distribution families are supported. But that’s fine in context. The reason to use the Emacs calculator is to do a quick calculation without having to leave Emacs, not to develop production quality statistical software.

Inline computed content in org-mode

The previous post discussed how to use org-mode as a notebook. You can have blocks of code and blocks of results, analogous to cells in a Jupyter notebook. The code and the results export as obvious blocks when you export the org file to another format, such as LaTeX or HTML. And that’s fine for a notebook.

Now suppose you want to do something more subtle. You want to splice in the result of a computed value without being obvious about it. Maybe you want to compute a value rather than directly enter it so that the document will remain consistent. Maybe you have a template and you want to set the parameters of the template at the top of the file.

Web development languages like PHP do this well. You can write a PHP file that is essentially an HTML file with pieces of code spliced in. You do this my inserting

    <?php … ?>

into the HTML code, and when the page is rendered the code between the <?php and ?> tags is replaced with the result of executing the code. We’d like to do something analogous in org-mode with org-babel. (org-babel is the subsystem of org-mode that interacts with code.)

Here’s an org-mode example that sets length and width as variables at the top of a file and multiplies them later in the body of the file to get area.

We define our variables as follows. The block is marked :exports none because we do not want to display the code or the values. We just want the code to run when we export the file.

    #+begin_src python :session :exports none
    length, width = 7, 13
    #+end_src

The following almost does what we want [1].

    Area equals src_python[:session]{length*width}.

This renders as

Area equals 91.

if we export our org file to HTML The number 91 is typeset differently than the words before it. This would be more obvious if the computed value were a string rather than a number.

Org-mode is wrapping <code> tags around the computed result. If we were to export the org file to LaTeX it would wrap the result with \texttt{}. This is because, by default, the output of a computation is displayed as computer output, which is conventionally set in a monospace font like Courier. That’s fine in a technical document when we want to make it obvious that a calculation is a calculation, but typically not in a business context. You wouldn’t want, for example, to generate a letter that starts

Dear Michael,

with Michael’s name set in Courier, announcing that this is a form letter.

The fix is to add :results raw to the header session, the part in square brackets between src_python and the Python code.

    Area equals src_python[:session :results raw]{length*width}.

Now the calculation result is reported “raw”, i.e. without any special markup surrounding it.

***

[1] In this example I’m using Python, and so I used the function src_python. org-babel supports dozens of languages, and each has its src_<language> counterpart.

Org-mode as a lightweight notebook

You can think of org-mode as simply a kind of markdown, a plain text file that can be exported to fancier formats such as HTML or PDF. It’s a lot more than that, but that’s a reasonable place to start.

Org-mode also integrates with source code. You can embed code in your file and have the code and/or the result of running the code appear when you export the file to another format.

Org-mode as notebook

You can use org-mode as a notebook, something like a Jupyter notebook, but much simpler. An org file is a plain text file, and you can execute embedded code right there in your editor. You don’t need a browser, and there’s no hidden state.

Here’s an example of mixing markup and code:

    The volume of an n-sphere of radius r is 

    $$\frac{\pi^{\frac{n}{2}}}{\Gamma\left(\frac{n}{2} + 1\right)}r^n.$$

    #+begin_src python :session
    from scipy import pi
    from scipy.special import gamma

    def vol(r, n):
        return pi**(n/2)*r**n/gamma(n/2 + 1)

    vol(1, 5)
    #+end_src

If you were to export the file to PDF, the equation for the volume of a sphere would be compiled into a image using LaTeX.

To run the code [1], put your cursor somewhere in the code block and type C-c C-c. When you do, the following lines will appear below your code.

    #+RESULTS:
    : 5.263789013914324

If you think of your org-mode file as primary, and you’re just inserting some code as a kind of scratch area, an advantage of org-mode is that you never leave your editor.

Jupyter notebooks

Now let’s compare that to a Jupyter notebook. Jupyter organizes everything by cells, and a cell can contain markup or code. So you could create a markup cell and enter the exact same introductory text [2].

    The volume of an n-sphere of radius r is 

    $$\frac{\pi^{\frac{n}{2}}}{\Gamma\left(\frac{n}{2} + 1\right)}r^n$$.

When you “run” the cell, the LaTeX is processed and you see the typeset expression rather than its LaTeX source. You can click on the cell to see the LaTeX code again.

Then you would enter the Python code in another cell. When you run the cell you see the result, much as in org-mode. And you could export your notebook to PDF as with org-mode.

File diff

Now suppose we make a couple small changes. We want the n and r in the comment section set in math italic, and we’d like to find the volume of a 5-sphere of radius 2 rather than radius 1. We do this, in Jupyter and in org-mode [3], by putting dollar signs around the “n” and the “r”, and we change vol(1, 5) to vol(2, 5).

Let’s run diff on the two versions of the org-mode file and on the two versions of the Jupyter notebook.

The differences in the org files are easy to spot:

    1c1
    < The volume of an n-sphere of radius r is 
    ---
    > The volume of an \(n\)-sphere of radius \(r\) is 
    12c12
    < vol(1, 5)
    ---
    > vol(2, 5)
    16c16,17
    < : 5.263789013914324
    ---
    > : 168.44124844525837

However, the differences in the Jupyter files are more complicated:

    5c5
    <    "id": "2a1b0bc4",
    ---
    >    "id": "a0a89fcf",
    8c8
    <     "The volume of an n-sphere of radius r is \n",
    ---
    >     "The volume of an $n$-sphere of radius $r$ is \n",
    15,16c15,16
    <    "execution_count": 1,
    <    "id": "888660a2",
    ---
    >    "execution_count": 2,
    >    "id": "1adcd8b1",
    22c22
    <        "5.263789013914324"
    ---
    >        "168.44124844525837"
    25c25
    <      "execution_count": 1,
    ---
    >      "execution_count": 2,
    37c37
    <     "vol(1, 5)"
    ---
    >     "vol(2, 5)"
    43c43
    <    "id": "f8d4d1b0",

There’s a lot of extra stuff in a Jupyter notebook. This is a trivial notebook, and more complex notebooks have more extra stuff. An apparently small change to the notebook can cause a large change in the underlying notebook file. This makes it difficult to track changes in a Jupyter notebook in a version control system.

Related posts

[1] Before this will work, you have to tell Emacs that Python is one of the languages you want to run inside org-mode. I have the following line in my init file to tell Emacs that I want to be able to run Python, DITAA, R, and Perl.

    (org-babel-do-load-languages 'org-babel-load-languages '((python . t) (ditaa . t) (R . t) (perl . t)))

[2] Org-mode will let you use \[ and \] to bracket LaTeX code for a displayed equation, and it will also let you use $$. Jupyter only supports the latter.

[3] In org-mode, putting dollar signs around variables sometimes works and sometimes doesn’t. And in this example, it works for the “r” but not for the “n”. This is very annoying, but it can be fixed by using \( and \) to enter and leave math mode rather than use a dollar sign for both.

Deleting reproducible files in Emacs dired

Imagine you could list the contents of a directory from a command line, and then edit the text output to make things happen. That’s sorta how Emacs dired works. It’s kind of a cross between a bash shell and the Windows File Explorer. Why would you ever want to use such a bizarre hybrid?

One reason is to avoid context switching. If you’re editing a file, you can pop over to a new buffer that is your file manager, do what you need to do, then pop back, all without ever leaving Emacs.

Another reason is that, as with everything else in Emacs, it’s all text. Everything in Emacs is just text, and so the same editing commands can be used everywhere. (More on that here.)

Even though I use Emacs daily, and even though I can make a case for why dired is great, I don’t use it that much. Or rather, I don’t use that much of what it can do. The good that I would, I do not.

I was reviewing dired‘s features and discovered something very handy: typing %& will mark files for deletion that can easily be created again. In particular, it will flag the byproducts of compiling a LaTeX file.

For example, the following is a screenshot of a dired buffer.

When I type %& it highlights the LaTeX temp files in red and marks them for deletion. (The D’s in the left column indicate files to be deleted.)

This doesn’t delete the files, but it marks them for deletion. If I then type x the files will be deleted.

In addition to the unneeded LaTeX files, it also highlighted a .bak file. However, it did not highlight the .o object file. I suppose the thought was that most people would manage C programs from a make file. I’m sure the class of files to mark is configurable, like everything else in Emacs.

More Emacs posts

Opening Windows files from bash and eshell

I often work in a sort of amphibious environment, using Unix software on Windows. As you can well imagine, this causes headaches. But I’ve found such headaches are generally more manageable than the headaches from alternatives I’ve tried.

On the Windows command line, you can type the name of a file and Windows will open the file with the default application associated with its file extension. For example, typing foo.docx and pressing Enter will open the file by that name using Microsoft Word, assuming that is your default application for .docx files.

Unix shells don’t work that way. The first thing you type at the command prompt must be a command, and foo.docx is not a command. The Windows command line generally works this way too, but it makes an exception for files with recognized extensions; the command is inferred from the extension and the file name is an argument to that command.

WSL bash

When you’re running bash on Windows, via WSL (Windows Subsystem for Linux), you can run the Windows utility start which will open a file according to its extension. For example,

    cmd.exe /C start foo.pdf

will open the file foo.pdf with your default PDF viewer.

You can also use start to launch applications without opening a particular file. For example, you could launch Word from bash with

    cmd.exe /C start winword.exe

Emacs eshell

Eshell is a shell written in Emacs Lisp. If you’re running Windows and you do not have access to WSL but you do have Emacs, you can run eshell inside Emacs for a Unix-like environment.

If you try running

    start foo.pdf

that will probably not work because eshell does not use the windows PATH environment.

I got around this by creating a Windows batch file named mystart.bat and put it in my path. The batch file simply calls start with its argument:

    start %

Now I can open foo.pdf from eshell with

    mystart foo.pdf

The solution above for bash

    cmd.exe /C start foo.pdf

also works from eshell.

(I just realized I said two contradictory things: that eshell does not use your path, and that it found a bash file in my path. I don’t know why the latter works. I keep my batch files in c:/bin, which is a Unix-like location, and maybe eshell looks there, not because it’s in my Windows path, but because it’s in what it would expect to be my path based on Unix conventions. I’ve searched the eshell documentation, and I don’t see how to tell what it uses for a path.)

More shell posts

Org entities

This morning I found out that Emacs org-mode has its own markdown entities, analogous to HTML entities or LaTeX commands. Often they’re identical to LaTeX commands. For example, \approx is the approximation symbol ≈, exactly as in LaTeX.

So what’s the advantage of org-entities? In fact, how does Emacs even know whether \approx is a LaTeX command or an org entity?

If you use the command C-c C-x \ , Emacs will show you the compiled version of the entity, i.e. ≈ rather than the command \approx. This is global: all entities are displayed. The org entities would be converted to symbols if you export the file to HTML or LaTeX, but this gives you a way to see the symbols before exporting.

Here something that’s possibly surprising, possibly useful. The symbol you see is for display only. If you copy and paste it to another program, you’ll see the entity text, not the symbol. And if you C-c C-x \ again, you’ll see the command again, not the symbol; Note that the full name of the command is org-toggle-pretty-entities with “toggle” the middle.

If you use set-input-method to enter symbols using LaTeX commands or HTML entities as I described here, Emacs inserts a Unicode character and is irreversible. Once you type the LaTeX command \approx or the corresponding HTML entity &asymp;, any knowledge of how that character was entered is lost. So org entities are useful when you want to see Unicode characters but want your source file to remain strictly ASCII.

Incidentally, there are org entities for Hebrew letters, but only the first four, presumably because these are the only ones used as mathematical symbols.

To see a list of org entities, use the command org-entities-help. Even if you never use org entities, the org entity documentation makes a nice reference for LaTeX commands and HTML entities. Here’s a screenshot of the first few lines.

First few lines of org-entities-help

Related posts

Entering symbols in Emacs

Emacs has a relatively convenient way to add accents to letters or to insert a Unicode character if you know the code point for the value. See these notes.

But usually you don’t know the Unicode values of symbols. Then what do you do?

TeX commands

You enter symbols by typing their corresponding TeX commands by using

    M-x set-input-method RET tex

After doing that, you could, for example, enter π by typing \pi.

You’ll see the backslash as you type the command, but once you finish you’ll see the symbol instead [1].

HTML entities

You may know the HTML entity for a symbol and want to use that to enter characters in Emacs. Unfortunately, the following does NOT work.

    M-x set-input-method RET html

However, there is a slight variation on this that DOES work:

    M-x set-input-method RET sgml

Once you’ve set your input method to sgml, you could, for example, type &radic; to insert a √ symbol.

Why SGML rather than HTML?

HTML was created by simplifying SGML (Standard Generalized Markup Language). Emacs is older than HTML, and so maybe Emacs supported SGML before HTML was written.

There may be some useful SGML entities that are not in HTML, though I don’t know. I imagine these days hardly anyone knows anything about SGML beyond the subset that lives on in HTML and XML.

Changing input modes

If you want to move between your default input mode and TeX mode, you can use the command toggle-input-method. This is usually mapped to C-u C-\.

You can see a list of all available input methods with list-input-methods. Most of these are spoken languages, such as Arabic or Welsh, rather than technical input modes like TeX and SGML.

More Emacs posts

[1] I suppose there could be a problem if one command were a prefix of another. That is, if there were symbols \foo and \foobar and you intended to insert the latter, Emacs might think you’re done after you’ve typed the former. But I can’t think of a case where that would happen. TeX commands are nearly prefix codes. There are TeX commands like \tan and \tanh, but these don’t represent symbols per se. Emacs doesn’t need any help to insert the letters “tan” or “tanh” into a file.

Control characters

I didn’t realize until recently that there’s a connection between the control key on a computer keyboard and controlling a mechanical device. Both uses of the word control are related via ASCII control characters as I discovered by reading the blog post Four Column ASCII.

Computers work with bits in groups of eight, and there are a lot more possible eight-bit combinations than there are letters in the Roman alphabet, so some of the values were reserved for printer control codes. This is most obvious when you arrange the table of ASCII values in four columns, hence the title of the post above.

Most of the codes for controlling printers are obsolete, but historical vestiges remain. When you hold down the control key and type a letter, it may produce a corresponding control character which differs from the letter by flipping its second bit from 1 to 0, though often the control keys have been put to other uses.

Control-H

The letter H has ASCII code 0100 1000 and the back space control character has ASCII code 0000 1000. In some software, such as the bash shell and the Windows command line cmd, holding down the control key and typing H has the same effect as using the backspace key [1].

Other software uses Control-H for its own purposes. For example, Windows software often uses it to bring up a find-and-replace dialog, and Emacs uses it as the prefix to a help command.

Control-I

In ASCII the letter I is 0100 1001 and the tab character is 0000 1001. In some software you can produce a tab character with Control-I. This works in Emacs and in Notepad, for example. It doesn’t work in WYSIWYG programs like Word where Control-I usually formats text in italic.

Control-J and Control-M

The letter J has ASCII code 0100 1010 and the line feed control character has ASCII code 0000 1010. In some software typing Control-J inserts a line feed, and in other software it does something analogous.

Unix uses a line feed character to denote the start of a new line, but DOS used a carriage return and a line feed. If you type Control-J in Windows Notepad, you’ll get a new line, but it will be saved as a carriage return and a line feed.

In Emacs, the behavior of Control-J depends on the mode. In text mode, it simply inserts a newline. In TeX mode, Control-J ends a paragraph, but it also checks the preceding paragraph for unbalanced delimiters. If you have something like an open brace with no corresponding close brace, you’ll see a warning “Paragraph being closed appears to contain a mismatch.”

The carriage return character has ASCII code 0000 1101, and M has ASCII code 0100 1101. That why if a file was create on Windows and you open it in Unix, you may see ^M throughout the file.

Control-[

Some control characters correspond to characters other than letters. If you flip the second bit of the ASCII code for [ you get the control character for escape. And in some software, such as vi or Emacs, Control-[ has the same effect as the escape key.

More ASCII posts

[1] Control keys are often written with capital letters, like Control-H. This can be misleading if you think this means you have to also hold down the shift key as if you were typing a capital H. Control-h would be better notation. But the ASCII codes for control characters correspond to capital letters, so I use capital letters here.

Journalistic stunt with Emacs

Emacs has been called a text editor with ambitions of being an operating system, and some people semi-seriously refer to it as their operating system. Emacs does not want to be an operating system per se, but it is certainly ambitious. It can be a shell, a web browser, an email client, a calculator, a Lisp interpreter, etc. It’s possible to work all day and never leave Emacs. It would be an interesting experiment to do just that.

Journalist experiment

Journalists occasionally impose some restriction on themselves and write about the experience. For example, Kashmir Hill did an experiment earlier this year, blocking the Big Five tech companies—Amazon, Facebook, Google, Microsoft, and Apple—for a week each, then finally all in the same week, and wrote a series about her experience. It would be interesting for someone to work only from Emacs for a week and write about it.

Living exclusively inside Emacs would be hard. Emacs applications require effort to discover and learn how to use, and different people find different applications worth learning. Someone doing everything in Emacs for the sake of a story would have to use some features they would not otherwise find worthwhile.

Why stay inside Emacs?

The point of using a calculator, for example, inside Emacs is that it lets you stay in your primary work environment. You don’t have to open a new application to do a quick calculation. Also, since everything is text-based, everything can be navigated and edited the same way.

You may have run into a situation using Windows where some text can be copied, such as text inside an edit box, but other text cannot, such as text displayed on a dialog box. That doesn’t happen in Emacs since everything is editable text. Consistency and interoperability sometimes make it worthwhile to do things inside Emacs that could be done more easily in another application.

Finally, everything in Emacs is programmable. Something that is awkward to use manually might still be valuable since it can be automated.

Examples from recent posts

My previous post was about various ways to compute hash functions. I could have added Emacs to the list. Here’s how you could compute the SHA256 hash of “hello world” using Emacs Lisp:

    (secure-hash 'sha256 "hello world")

You could, for example, type the code above in the middle of a document and type Control-x Control-e to evaluate it as a Lisp expression.

I also wrote about software to factor integers recently, and you could do this in Emacs as well. You could pull up the Emacs calculator and type prfac(161393) for example and it would return a list of the prime factors: [251, 643].

Neither of these functions is best of breed. The secure-hash function only supports the most popular hash functions, unlike openssl. And prfac will work fail on large inputs, unlike PARI/GP. Emacs is ambitions, but not that ambitious. It doesn’t aim to replace specialized software, but to provide a convenient way to carry out common tasks.