Dump a pickle file to a readable text file

I got a data file from a client recently in “pickle” format. I happen to know that pickle is a binary format for serializing Python objects, but trying to open a pickle file could be a puzzle if you didn’t know this.

Be careful

There are a couple problems with using pickle files for data transfer. First of all, it’s a security risk because an attacker could create a malformed pickle file that would cause your system to run arbitrary code. In the Python Cookbook, the authors David Beazley and Brian K. Jones warn

It’s essential that pickle only be used internally with interpreters that have some ability to authenticate one another.

The second problem is that the format could change. Again quoting the Cookbook,

Because of its Python-specific nature and attachment to source code, you probably shouldn’t use pickle as a format for long-term storage. For example, if the source code changes, all of your stored data might break and become unreadable.

Suppose someone gives you a pickle file and you’re willing to take your chances and open it. It’s from a trusted source, and it was created recently enough that the format probably hasn’t changed. How do you open it?

Unpickling

The following code will open the file data.pickle and read it into an object obj.

    import pickle
    obj = pickle.load(open("data.pickle", "rb"))

If the object in the pickle file is very  small, you could simply print obj. But if the object is at all large, you probably want to save it to a file rather than dumping it at the command line, and you also want to “pretty” print it than simply printing it.

Pretty printing

The following code will dump a nicely-formatted version of our pickled object to a text file out.txt.

    import pickle
    import pprint

    obj = pickle.load(open("sample_data.pickle", "rb"))

    with open("out.txt", "a") as f:
         pprint.pprint(obj, stream=f)

In my case, the client’s file contained a dictionary of lists of dictionaries. It printed as one incomprehensible line, but it pretty printed as 40,000 readable lines.

Prettier printing

Simon Brunning left a comment suggesting that the json module output is even easier to read.

    import json

    with open("out.txt", "a") as f:
        json.dump(obj, f, indent=2)

And he’s right, at least in my case. The indentation json.dump produces is more what I’d expect, more like what you’d see if you were writing the structure in well-formatted source code.

Related posts

Org-mode as a lightweight notebook

You can think of org-mode as simply a kind of markdown, a plain text file that can be exported to fancier formats such as HTML or PDF. It’s a lot more than that, but that’s a reasonable place to start.

Org-mode also integrates with source code. You can embed code in your file and have the code and/or the result of running the code appear when you export the file to another format.

Org-mode as notebook

You can use org-mode as a notebook, something like a Jupyter notebook, but much simpler. An org file is a plain text file, and you can execute embedded code right there in your editor. You don’t need a browser, and there’s no hidden state.

Here’s an example of mixing markup and code:

    The volume of an n-sphere of radius r is 

    $$\frac{\pi^{\frac{n}{2}}}{\Gamma\left(\frac{n}{2} + 1\right)}r^n.$$

    #+begin_src python :session
    from scipy import pi
    from scipy.special import gamma

    def vol(r, n):
        return pi**(n/2)*r**n/gamma(n/2 + 1)

    vol(1, 5)
    #+end_src

If you were to export the file to PDF, the equation for the volume of a sphere would be compiled into a image using LaTeX.

To run the code [1], put your cursor somewhere in the code block and type C-c C-c. When you do, the following lines will appear below your code.

    #+RESULTS:
    : 5.263789013914324

If you think of your org-mode file as primary, and you’re just inserting some code as a kind of scratch area, an advantage of org-mode is that you never leave your editor.

Jupyter notebooks

Now let’s compare that to a Jupyter notebook. Jupyter organizes everything by cells, and a cell can contain markup or code. So you could create a markup cell and enter the exact same introductory text [2].

    The volume of an n-sphere of radius r is 

    $$\frac{\pi^{\frac{n}{2}}}{\Gamma\left(\frac{n}{2} + 1\right)}r^n$$.

When you “run” the cell, the LaTeX is processed and you see the typeset expression rather than its LaTeX source. You can click on the cell to see the LaTeX code again.

Then you would enter the Python code in another cell. When you run the cell you see the result, much as in org-mode. And you could export your notebook to PDF as with org-mode.

File diff

Now suppose we make a couple small changes. We want the n and r in the comment section set in math italic, and we’d like to find the volume of a 5-sphere of radius 2 rather than radius 1. We do this, in Jupyter and in org-mode [3], by putting dollar signs around the “n” and the “r”, and we change vol(1, 5) to vol(2, 5).

Let’s run diff on the two versions of the org-mode file and on the two versions of the Jupyter notebook.

The differences in the org files are easy to spot:

    1c1
    < The volume of an n-sphere of radius r is 
    ---
    > The volume of an \(n\)-sphere of radius \(r\) is 
    12c12
    < vol(1, 5)
    ---
    > vol(2, 5)
    16c16,17
    < : 5.263789013914324
    ---
    > : 168.44124844525837

However, the differences in the Jupyter files are more complicated:

    5c5
    <    "id": "2a1b0bc4",
    ---
    >    "id": "a0a89fcf",
    8c8
    <     "The volume of an n-sphere of radius r is \n",
    ---
    >     "The volume of an $n$-sphere of radius $r$ is \n",
    15,16c15,16
    <    "execution_count": 1,
    <    "id": "888660a2",
    ---
    >    "execution_count": 2,
    >    "id": "1adcd8b1",
    22c22
    <        "5.263789013914324"
    ---
    >        "168.44124844525837"
    25c25
    <      "execution_count": 1,
    ---
    >      "execution_count": 2,
    37c37
    <     "vol(1, 5)"
    ---
    >     "vol(2, 5)"
    43c43
    <    "id": "f8d4d1b0",

There’s a lot of extra stuff in a Jupyter notebook. This is a trivial notebook, and more complex notebooks have more extra stuff. An apparently small change to the notebook can cause a large change in the underlying notebook file. This makes it difficult to track changes in a Jupyter notebook in a version control system.

Related posts

[1] Before this will work, you have to tell Emacs that Python is one of the languages you want to run inside org-mode. I have the following line in my init file to tell Emacs that I want to be able to run Python, DITAA, R, and Perl.

    (org-babel-do-load-languages 'org-babel-load-languages '((python . t) (ditaa . t) (R . t) (perl . t)))

[2] Org-mode will let you use \[ and \] to bracket LaTeX code for a displayed equation, and it will also let you use $$. Jupyter only supports the latter.

[3] In org-mode, putting dollar signs around variables sometimes works and sometimes doesn’t. And in this example, it works for the “r” but not for the “n”. This is very annoying, but it can be fixed by using \( and \) to enter and leave math mode rather than use a dollar sign for both.

Find log normal parameters for given mean and variance

Earlier today I needed to solve for log normal parameters that yield a given mean and variance. I’m going to save the calculation here in case I needed in the future or in case a reader needs it. The derivation is simple, but in the heat of the moment I’d rather look it up and keep going with my train of thought.

NB: The parameters μ and σ² of a log normal distribution are not the mean and variance of the distribution; they are the mean and variance of its log.

If m is the mean and v is the variance then

\begin{align*} m &= \exp(\mu + \sigma^2/2) \\ v &= (\exp(\sigma^2) - 1) \exp(2\mu + \sigma^2) \end{align*}

Notice that the square of the m term matches the second part of the v term.

Then

\frac{v}{m^2} = \exp(\sigma^2) -1

and so

\sigma^2 = \log\left(\frac{v}{m^2} + 1 \right)

and once you have σ² you can find μ by

\mu = \log m - \sigma^2/2

Here’s Python code to implement the above.

    from numpy immport log
    def solve_for_log_normal_parameters(mean, variance):
        sigma2 = log(variance/mean**2 + 1)
        mu = log(mean) - sigma2/2
        return (mu, sigma2)

And here’s a little test code for the code above.

    mean = 3.4
    variance = 5.6

    mu, sigma2 = solve_for_log_normal_parameters(mean, variance)

    X = lognorm(scale=exp(mu), s=sigma2**0.5)
    assert(abs(mean - X.mean()) < 1e-10)
    assert(abs(variance - X.var()) < 1e-10)

Related posts

Spaceship operator in Python

Some programming languages, such as Perl, have an infix operator <=> that returns a three-state comparison. The expression

    a <=> b

evaluates to -1, 0, or 1 depending on whether a < b, a = b, or a > b. You could think of <=> as a concatenation of <, =, and >.

The <=> operator is often called the “spaceship operator” because it looks like Darth Vader’s ship in Star Wars.

Python doesn’t have a spaceship operator, but you can get the same effect with numpy.sign(a-b). For example, suppose you wanted to write a program to compare two integers.

You could write

    from numpy import sign
    def compare(x, y):
        cmp = ["equal to", "greater than", "less than"][sign(x-y)]
        print(f"{x} is {cmp} {y}.")

Here we take advantage of the fact that an index of -1 points to the last element of a list.

The sign function will return an integer if its argument is an integer or a float if its argument is a float. The code above will break if you pass in floating point numbers because sign will return -1.0, 0.0, or 1.0. But if you replace sign(x-y) with int(sign(x-y)) it will work for floating point arguments.

Related post: Symbol pronunciation

Illustrating Gershgorin disks with NumPy

Gershgorin’s theorem gives bounds on the locations of eigenvalues for an arbitrary square complex matrix.

The eigenvalues are contained in disks, known as Gershgorin disks, centered on the diagonal elements of the matrix. The radius of the disk centered on the kth diagonal element is the sum of the absolute values of the elements in the kth row, excluding the diagonal element itself.

To illustrate this theorem, we create a 5 by 5 diagonal matrix and add some random noise to it. The diagonal elements are

0, 3 + i, 4 + i, 1 + 5i, 9 + 2i.

The eigenvalues of the diagonal matrix would simply be the diagonal elements. The additional random values pull the eigenvalues away from the diagonal values, but by an amount no more than Gershgorin predicts.

Gershgorin disks and eigenvalues

Note that in this example, two of the eigenvalues land in the smallest disk. It’s possible that a disk may not contain any eigenvalues; what Gershgorin guarantees is that the union of all the disks contains the union of all the eigenvalues.

Here’s the Python code that created the graph above.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(202103014)

N = 5 # dimension of our square matrix

D = np.diag([0, 3 + 1j, 4 + 1j, 1 + 5j, 9 + 2j])
M = np.random.rand(N, N) + D

R = np.zeros(N) # disk radii
for i in range(N):
    R[i] = sum(abs(M[i,:])) - abs(M[i,i])

eigenvalues = np.linalg.eigvals(M)

# Plotting code
fig, ax = plt.subplots()
for k in range(N):
    x, y = M[k,k].real, M[k,k].imag
    ax.add_artist( plt.Circle((x, y), R[k], alpha=0.5) )
    plt.plot(eigenvalues[k].real, eigenvalues[k].imag, 'k+')

ax.axis([-4, 12.5, -4, 9])
ax.set_aspect(1)    
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.title("Gershgorin disks and eigenvalues $x + iy$")

Pareto and Pandas

This post muses about what it means to learn a software library. I’ll use Pandas as an example, but the post isn’t just about Pandas.

Suppose you say “I want to learn Pandas.” That implicitly assumes Pandas one thing, and in a sense it is. In another sense Pandas is hundreds of things.

At the top level, the pandas module (version 1.2.0) has 142 things inside.

    >>> import pandas as pd
    >>> len(dir(pd))
    142

The two most important things inside are the Series and DataFrame objects. They each in turn contain hundreds of things.

    >>> len(dir(pd.Series))
    434
    >>> len(dir(pd.DataFrame))
    441

That’s evidence Pandas’ diversity. But here’s evidence of it’s unity: most of the things inside these two objects have the same names.

    >>> s = set(dir(pd.Series))
    >>> d = set(dir(pd.DataFrame))
    >>> len(s.union(d))
    491
    >>> len(s - d)
    50
    >>> len(d - s)
    57

Pandas kinda has a fractal dimension, having both complexity and unity. The best way to think about it is not as one monolithic thing, or as hundreds of isolated things. It’s a coherent, but not perfectly coherent, collection of related things. This is true of all software libraries. Pandas is more coherent than most libraries because it was initially the product of one mind, that of Wes McKinney.

This has a couple implications for what it means to “learn Pandas.” Because Pandas is big, you have to explore it strategically, not exhaustively. And because Pandas is coherent, part of what it means to learn Pandas is to develop a feel for the way Pandas does things.

No one is going to learn Pandas by studying every object, every method on every object, and every argument to every method on every object. It’s too big. That’s also unnecessary.

There’s probably something like a Pareto distribution on the usefulness of features. The most commonly used features are used far, far more often than the most obscure features.

It would be interesting to do some kind of survey to see which features are actually used and how often. But I don’t think that’s practical. The easiest thing to do would be to find some large code base that heavily uses Pandas. But that’s not typical of how Pandas is used. Probably most lines of code using Pandas are scattered over millions of small scripts, much of it not in production code.

A well-designed library makes it possible to make good guesses about functionality you haven’t used. You learn the gestalt of the library. You can always look up API documentation as needed, but you can’t develop an intuition for a library just-in-time.

“Learn Pandas” is a daunting goal, and maybe an impossible goal if by “learn” you mean explore exhaustively. But “learn how to do my common tasks quickly in Pandas” and “develop a feel for how to do things in Pandas” are much smaller tasks.

Related posts

Broadcasting and functors

In my previous post, I looked at the map Δ that takes a column vector to a diagonal matrix. I even drew a commutative diagram, which foreshadows a little category theory.

Suppose you have a function f of a real or complex variable. To an R programmer, if x is a vector, it’s obvious that f(x) means to apply f to every component of a vector. Python (NumPy) works the same way, and calls this broadcasting. To a mathematician, this looks odd. What does the logarithm of a vector, for example, even mean?

As in the previous post, we can use Δ to formalize things. We said that Δ has some nice properties, and in fact we will show it is a functor.

To have a functor, we have to have categories. (Historically, functors came first; categories were defined in order to define functors.) We will define C to be the category of column vectors and M the category of square matrices as before. Or rather, we should say the objects of C are column vectors and the objects of M are square matrices.

Categories need morphisms, functions between objects [1]. We define the morphisms on C to be analytic functions applied componentwise. So, for example, if

z = [1, 2, -3],

then

tan(z) = [tan(1), tan(2), tan(-3)].

The morphisms on M will be analytic functions on square matrices, not applied componentwise but applied by power series. That is, given an analytic function f, we define f of a square matrix X as the result of sticking the matrix X into the power series for f. For an example, see What is the cosine of a matrix?

We said that Δ is a functor. It takes column vectors and turns them into square matrices by putting their contents along the diagonal of a matrix. We gave the example in the previous post that [4, i, π] would be mapped to the matrix with these elements on the diagonal, i.e.

\Delta: \begin{pmatrix} 4 \\ i \\ \pi \end{pmatrix} \mapsto \begin{pmatrix} 4 & 0 & 0\\ 0 & i & 0 \\ 0 & 0 & \pi \end{pmatrix}

That says what Δ does on objects, but what does it do on morphisms? It takes an analytic function that was applied componentwise to column vectors, and turns it into a function that is applied via its power series to square matrices. That is, starting with a function

f(z) = \sum_{n=0}^\infty a_nz^n

we define the morphism f on C by

f : \begin{pmatrix} z_1 \\ z_2 \\ \vdots \\ z_n \end{pmatrix} \mapsto \begin{pmatrix} f(z_1) \\ f(z_2) \\ \vdots \\ f(z_n) \end{pmatrix}

and the morphism Δ f on M by

\Delta f : Z \mapsto \sum_{n=0}^\infty a_n Z^n

where Z is a square matrix.

We can apply f to a column vector, and then apply Δ to turn the resulting vector into a diagonal matrix, or we could apply Δ to turn the vector into a diagonal matrix first, and then apply f (technically,  Δf). That is, the follow diagram commutes:

\begin{diagram}   C & \rTo^{\Delta} & M \\   \uTo^{f} & & \uTo_{\Delta f} \\   C & \rTo^{\Delta} & M  \end{diagram}

Python example

Applying an analytic function to a diagonal matrix gives the same result as simply applying the function to the elements of the diagonal. But for more general square matrices, this is not the case. We will illustrate this with some Python code.

    import numpy as np
    from scipy.linalg import funm

    d = np.array([1, 2])
    D = np.diag(d)
    M = np.array([[1, np.pi], [2, 0]])

Now let’s look at some output.

    >>> np.sin(d)                     
    array([0.84147098, 0.90929743])   

    >>> np.sin(D)                     
    array([[0.84147098, 0.        ],  
           [0.        , 0.90929743]]) 

    >>> funm(D, np.sin)               
    array([[0.84147098, 0.        ],  
           [0.        , 0.90929743]]) 

So if we take the sine of d and turn the result into a matrix, we get the same thing as if we turn d into a matrix D and then take the sine of D, either componentwise or as an analytic function (with funm, function of a matrix).

Now let’s look at a general, non-diagonal matrix.

    >>> np.sin(M)
    array([[0.84147099, 0],
       [0.90929743, 0]])

    >>> funm(D, np.sin)
    array([[0.84147098, 0.        ],
           [0.        , 0.90929743]])

Note that the elements in the bottom row are in opposite positions in the two examples.

[1] OK, morphisms are not necessarily functions, but in practice they usually are.

Python triple quote strings and regular expressions

There are several ways to quote strings in Python. Triple quotes let strings span multiple lines. Line breaks in your source file become line break characters in your string. A triple-quoted string in Python acts something like “here doc” in other languages.

However, Python’s indentation rules complicate matters because the indentation becomes part of the quoted string. For example, suppose you have the following code outside of a function.

x = """\
abc
def
ghi
"""

Then you move this into a function foo and change its name to y.

def foo():
    y = """\
    abc
    def
    ghi
    """

Now x and y are different strings! The former begins with a and the latter begins with four spaces. (The backslash after the opening triple quote prevents the following newline from being part of the quoted string. Otherwise x and y would begin with a newline.) The string y also has four spaces in front of def and four spaces in front of ghi. You can’t push the string contents to the left margin because that would violate Python’s formatting rules. (Update: Oh yes you can! See Aaron Meurer’s comment below.)

We now give three solutions to this problem.

Solution 1: textwrap.dedent

There is a function in the Python standard library that will strip the unwanted space out of the string y.

import textwrap 

def foo():
    y = """\
    abc
    def
    ghi
    """
    y = textwrap.dedent(y)

This works, but in my opinion a better approach is to use regular expressions [1].

Solution 2: Regular expression with a flag

We want to remove white space, and the regular expression for a white space character is \s. We want to remove one or more white spaces so we add a + on the end. But in general we don’t want to remove all white space, just white space at the beginning of a line, so we stick ^ on the front to say we want to match white space at the beginning of a line.

import re 

def foo():
    y = """\
    abc
    def
    ghi
    """
    y = re.sub("^\s+", "", y)

Unfortunately this doesn’t work. By default ^ only matches the beginning of a string, not the beginning of a line. So it will only remove the white space in front of the first line; there will still be white space in front of the following lines.

One solution is to add the flag re.MULTILINE to the substitution function. This will signal that we want ^ to match the beginning of every line in our multi-line string.

    y = re.sub("^\s+", "", y, re.MULTILINE)

Unfortunately that doesn’t quite work either! The fourth positional argument to re.sub is a count of how many substitutions to make. It defaults to 0, which actually means infinity, i.e. replace all occurrences. You could set count to 1 to replace only the first occurrence, for example. If we’re not going to specify count we have to set flags by name rather than by position, i.e. the line above should be

    y = re.sub("^\s+", "", y, flags=re.MULTILINE)

That works.

You could also abbreviate re.MULTILINE to re.M. The former is more explicit and the latter is more compact. To each his own. There’s more than one way to do it. [2]

Solution 3: Regular expression with a modifier

In my opinion, it is better to modify the regular expression itself than to pass in a flag. The modifier (?m) specifies that in the rest of the regular the ^ character should match the beginning of each line.

    y = re.sub("(?m)^\s+", "", y)

One reason I believe this is better is that moves information from a language-specific implementation of regular expressions into a regular expression syntax that is supported in many programming languages.

For example, the regular expression

    (?m)^\s+

would have the same meaning in Perl and Python. The two languages have the same way of expressing modifiers [3], but different ways of expressing flags. In Perl you paste an m on the end of a match operator to accomplish what Python does with setting flasgs=re.MULTILINE.

One of the most commonly used modifiers is (?i) to indicate that a regular expression should match in a case-insensitive manner. Perl and Python (and other languages) accept (?i) in a regular expression, but each language has its own way of adding modifiers. Perl adds an i after the match operator, and Python uses

    flags=re.IGNORECASE

or

    flags=re.I

as a function argument.

More on regular expressions

[1] Yes, I’ve heard the quip about two problems. It’s funny, but it’s not a universal law.

[2] “There’s more than one way to do it” is a mantra of Perl and contradicts The Zen of Python. I use the line here as a good-natured jab at Python. Despite its stated ideals, Python has more in common with Perl than it would like to admit and continues to adopt ideas from Perl.

[3] Python’s re module doesn’t support every regular expression modifier that Perl supports. I don’t know about Python’s regex module.

Searching Greek and Hebrew with regular expressions

According to the Python Cookbook, “Mixing Unicode and regular expressions is often a good way to make your head explode.” It is thus with fear and trembling that I dip my toe into using Unicode with Greek and Hebrew.

I heard recently that there are anomalies in the Hebrew Bible where the final form of a letter is deliberately used in the middle of a word. That made me think about searching for such anomalies with regular expressions. I’ll come back to that shortly, but I’ll start by looking at Greek where things are a little simpler.

Greek

Only one letter in Greek has a variant form at the end of a word. Sigma is written ς at the end of a word and σ everywhere else. The following Python code shows we can search for sigmas and final sigmas in the first sentence from Matthew’s gospel.

    import re

    matthew = "Κατάλογος του γένους του Ιησού Χριστού, γιου του Δαυείδ, γιου του Αβραάμ"
    matches = re.findall(r"\w+ς\b", matthew)
    print("Words ending in sigma:", matches)

    matches = re.findall(r"\w*σ\w+", matthew)
    print("Words with non-final sigmas:", matches)

This prints

    Words ending in sigma: ['Κατάλογος', 'γένους']
    Words with non-final sigmas: ['Ιησού', 'Χριστού']

This shows that we can use non-ASCII characters in regular expressions, at least if they’re Greek letters, and that the metacharacters \w (word character) and \b (word boundary) work as we would hope.

Hebrew

Now for the motivating example. Isaiah 9:6 contains the word “לםרבה” with the second letter (second from right) being the final form of Mem, ם, sometimes called “closed Mem” because the form of the letter ordinarily used in the middle of a word, מ, is not a closed curve [1].

Here’s code to show we could find this anomaly with regular expressions.

    # There are five Hebrew letters with special final forms.
    finalforms = "\u05da\u05dd\u05df\u05e3\u05e5" # ךםןףץ

    lemarbeh = "\u05dc\u05dd\u05e8\u05d1\u05d4" # לםרבה
    m = re.search(r"\w*[" + finalforms + "\w+", lemarbeh)
    if m:
        print("Anomaly found:", m.group(0))

As far as I know, the instance discussed above is the only one where a final letter appears in the middle of a word. And as far as I know, there are no instances of a non-final form being used at the end of a word. For the next example, we will put non-final forms at the end of words so we can find them.

We’ll need a list of non-final forms. Notice that the Unicode value for each non-final form is 1 greater than the corresponding final form. It’s surprising that the final forms come before the non-final forms in numerical (Unicode) order, but that’s how it is. The same is true in Greek: final sigma is U+03c2 and sigma is U+03c3.

    finalforms = "\u05da\u05dd\u05df\u05e3\u05e5" # ךםןףץ
    nonfinal   = "\u05db\u05de\u05e0\u05e4\u05e6" # כמנפצ

We’ll start by taking the first line of Genesis

    genesis = "בראשית ברא אלהים את השמים ואת הארץ"

and introduce errors by replacing each final form with its corresponding non-final form. The code uses a somewhat obscure feature of Python and is analogous to the shell utility tr.

    genesis_wrong = genesis.translate(str.maketrans(finalforms, nonfinal))

(The string method translate does what you would expect, except that it takes a translation table rather than a pair of strings as its argument. The maketrans method creates a translation table from two strings.)

Now we can find our errors.

    anomalies = re.findall(r"\w+[" + nonfinal + r"]\b", genesis_wrong)
    print("Anomalies:", anomalies)

This produced

    Anomalies: ['אלהימ', 'השמימ', 'הארצ']

Note that Python printed the letters left-to-right.

Vowel points

The text above did not contain vowel points. If the text does contain vowel points, these are treated as letters between the consonants.

For example, the regular expression “בר” will not match against “בְּרֵאשִׁית” because the regex engine sees two characters between ב and ר, the dot (“dagesh”) inside ב and the two dots (“sheva”) underneath ב. But the regular expression “ב..ר” does match against “בְּרֵאשִׁית”.

See footnote [1].

Python

I started this post with a warning from the Python Cookbook that mixing regular expressions and Unicode can cause your head to explode. So far my head intact, but I’ve only tested the waters. I’m sure there are dragons in the deeper end.

I should include the rest of the book’s warning. I used the default library re that ships with Python, but the authors recommend if you’re serious about using regular expressions with Unicode,

… you should consider installing the third-party regex library which provides full support for Unicode case folding, as well as a variety of other interesting features, including approximate matching.

Related posts

[1] When I refer to “בר” as a regular expression, I mean the character sequence ב followed by ר, which your browser will display from right to left because it detects it as characters from a language written from right to left. Written in terms of Unicode code points this would be “\u05d1\u05e8”, the code point for ב followed by the code point for ר.

Similarly, the second regular expression could be written “\u05d1..\u05e8”. The two periods in the middle are just two instances of the regular expression symbol that matches any character. It’s a coincidence that dots (periods) are being used to match dots (the dagesh and the sheva).

Descartes and Toolz

I was looking recently at the Python module toolz, a collection of convenience functions. A lot of these functions don’t do that much. They don’t save you much code, but they do make your code more readable by making it more declarative. You may not realize need them until you see them.

For example, there is a function partitionby that breaks up a sequence at the points where a given function’s value changes. I’m pretty sure that function would have improved some code I’ve written recently, making it more declarative than procedural, but I can’t remember what that was.

Although I can’t think of my previous example, I can think of a new one, and that is Descartes’ rule of signs.

Given a polynomial p(x), read the non-zero coefficients in order and keep note of how many times they change sign, either from positive to negative or vice versa. Call that number n. Then the number of positive roots of p(x) either equals n or n minus a positive even number.

For example, suppose

p(x) = 4x4 + 3.1x3x2 – 2x + 6.

The coefficients are 4, 3.1, -1, -2, and 6. The list of coefficients changes signs twice: from positive to negative, and from negative to positive. Here’s a first pass at how you might have Python split the coefficients to look sign changes.

    from toolz import partitionby

    coefficients = [4, 3.1, -1, -2, 6]
    parts = partitionby(lambda x: x > 0, coefficients)
    print([p for p in parts])

This prints

    [(4, 3.1), (-1, -2), (6,)]

The first argument to partitionby an anonymous function that tests whether its argument is positive. When this function changes value, we have a sign alteration. There are three groups of consecutive coefficients that have the same sign, so there are two times the signs change. So our polynomial either has two positive roots or no positive roots. (It turns out there are no positive roots.)

The code above isn’t quite right though, because Descartes said to only look at non-zero coefficients. If we change our anonymous function to

    lambda x: x >= 0

that will work for zeros in the middle of positive coefficients, but it will give a false positive for zeros in the middle of negative coefficients. We can fix the code with a list comprehension. The following example works correctly.

    coefficients = [4, 0, 3.1, -1, 0, -2, 6]
    nonzero = [c for c in coefficients if c != 0]
    parts = partitionby(lambda x: x > 0, nonzero)
    print([p for p in parts])

If our coefficients were in a NumPy array rather than a list, we could remove the zeros more succinctly.

    from numpy import array

    c = array(coefficients)
    parts = partitionby(lambda x: x > 0, c[c != 0])

The function partitionby returns an iterator rather than a list. That’s why we don’t just print parts above. Instead we print [p for p in parts] which makes a list. In applications, it’s often more efficient to have an iterator than a list, generating items if and when they are needed. If you don’t need all the items, you don’t have to generate them all. And even if you do need all the items, you could save memory by not keeping them all in memory at once. I’ll ignore such efficiencies here.

We don’t need the partitions per se, we just need to know how many there are. The example that escapes my mind would have been a better illustration if it needed to do more with each portion than just count it. We could count the number of sign alternations for Descartes rule as follows.

   len([p for p in parts]) - 1

Related posts