Portable sed -i across MacOS and Linux

The -i flag to ask sed to edit a file in place works differently on Linux and MacOS. If you want to create a backup of your file before you edit it, say with the extension .bak, then on Linux you would run

    sed -i.bak myfile

but for the version of sed that ships with MacOS you would write

    sed -i '.bak' myfile

Note that this changes how sed interprets its arguments, whether you want a backup file or not. You must specify a backup file extension, but could specify the extension as '', which effectively means don’t make a backup.

The difference between how the two versions of sed handle their arguments is a minor nuisance when working interactively at the command line, but a script calling sed -i will not work the same on Mac and Linux.

I put the line

   alias sed=gsed

in my .zshrc file so that when I type sed the shell will actually run the Gnu version of sed, which handles -i as I expect.

But this does not work in bash scripts. I tried putting the alias in my .bashrc and .bash_profile files, but that doesn’t work. In scripts bash ignores aliases, no matter what config file you put them in. Here’s the relevant line from the bash man page:

Aliases are not expanded when the shell is not interactive, unless the expand_aliases shell option is set using shopt.

So the solution is to put the magic incantation

   shopt -s expand_aliases

at the top of your script.

Here’s how I wrote a script using sed -i works the same way on MacOS and Linux.

    #!/usr/bin/env bash
    shopt -s expand_aliases

    if [[ "$OSTYPE" == "darwin"* ]]; then
        alias sed=gsed
    fi

    sed -i ...

This may seem a little heavy-handed, changing the program that I’m using just to fix a problem with the arguments. I could, for example, have the script insert '' after -i when running on MacOS.

But I’m not changing the program I use. Quite the opposite. The code above saves me from having to use a different program. It insures that I’m running the Gnu version of sed on both platforms. There are other differences between the two versions of sed. I don’t know what they are off hand, but they could lead to frustrating bugs in the future, and the code above fixes these bugs before they occur.

Elliptic curve addition formulas

The geometric description of addition of points P and Q on an elliptic curve involves four logical branches:

If one of P or Q is the point at infinity …

Else if P = Q

Else if P and Q lie on a vertical line …

Else …

It would seem that an algorithm for adding points would have to have the same structure, which is unfortunate for at least a couple reasons: it adds complexity, and it raises the possibility of a timing attack since some branches will execute faster than others.

However, an algebraic formula for addition need not mirror its geometric motivation.

Jacobian coordinates

Jacobian coordinates are a different way to describe elliptic curves. They have the advantage that addition formulas have fewer logic branches than the geometric description. They may have one test of whether a point is the point at infinity but they don’t have multiple branches.

Jacobian coordinates also eliminate the need to do division. The geometric description of addition on an elliptic curve involves calculating where the line through two points intersects the curve, and this naturally involves calculating slopes. But in Jacobian coordinates, addition is done using only addition and multiplication, no division. When you’re working over the field of integers modulo some gigantic prime, multiplication and addition are much faster than division.

You can find much more on Jacobian coordinates, including explicit formulas and operation counts, here.

Complete formulas

Sometimes it’s possible to completely eliminate logic branches as well as division operations. An elliptic curve addition formula is called complete if it is valid for all inputs. The first surprise is that complete addition formulas sometimes exist. The second surprise is that complete addition formulas often exist.

You can find much more on complete addition formulas in the paper “Complete addition formulas for prime order elliptic curves” available here.

McCabe versus Kolmogorov

Whether the Jacobian coordinate formulas or complete formulas are more or less complex than direct implementation of the geometric definition depends on how you define complexity. The formulas are definitely less complex in terms of McCabe complexity, also known as cyclomatic complexity. But they are more complex in terms of Kolmogorov complexity.

The McCabe complexity of a function is essentially the number of independent paths through the function. A complete addition formula, one that could be implemented in software with no branching logic, has the smallest possible McCabe complexity.

I’m using the term Kolmogorov complexity to mean simply the amount of code it takes to implement a function. It’s intuitively clear what this means. But if you make it precise, you end up with a measure of complexity that is only useful in theory and cannot be calculated in practice.

Counting operations

Instead of literally computing Kolmogorov complexity you’d count the number of multiplications and additions necessary to execute a formula. The links above do just that. If you know how many cycles it takes to execute an addition or multiplication (in the field you’re working over), and how many cycles it would take to carry out addition (on the elliptic curve) implemented directly from the definition, include the division required, then you could estimate whether these alternative formulas would save time.

It used to be common to count how many floating point operations it would take to execute an algorithm. I did that back when I took numerical analysis in college. But this became obsolete not long after I learned it. Counting floating point operations no longer tells you as much about an algorithm’s runtime as it used to due to changes in the speed of arithmetic relative to memory access and changes in computer architecture. However, in the context of this post, counting operations still matters because each operation—such as adding two 512-bit numbers modulo a 512-bit prime—involves a lot of more basic operations.

Related posts

Elliptic curve Diffie-Hellman key exchange

I concluded the previous post by saying elliptic curve Diffie-Hellman key exchange (ECDHE) requires smaller keys than finite field Diffie-Hellman (FFDHE) to obtain the same level of security.

How much smaller are we talking about? According to NIST recommendations, a 256-bit elliptic curve provides about the same security as working over a 3072-bit finite field. Not only are elliptic curves smaller, they scale better. A 512-bit elliptic curve is believed to be about as secure as a 15360-bit finite field: a factor of 2x for elliptic curves and a factor of 5x for finite fields.

The core idea of Diffie-Hellman is to pick a group G, an element g, and a large number of x. If y is the result of starting with x and applying the group operation x times, it is difficult to recover x from knowing y. This is called the discrete logarithm problem, taking its name from the case of the group operation being multiplication. But the inverse problem is still called the discrete logarithm problem when the group is additive.

In FFDHE the group G is the multiplicative group of a generator g modulo a large prime p. Applying the group operation (i.e. multiplication) to g a number of times x is computing

y = gx

and x is rightly called a discrete logarithm; the process is directly analogous to taking the logarithm of a real number.

In ECDHE the group is given by addition on an elliptic curve. Applying the group operation x times to g, adding g to itself x times, is written xg. The problem of recovering x from xg is still called the discrete logarithm problem, though you could call it the discrete “division” problem.

Some groups are unsuited for Diffie-Hellman cryptography because the discrete logarithm problem is easy. If we let G be the additive group modulo a prime (not the multiplicative group) then it is easy to recover x from xg.

Note that when we talk about applying the group operation a large number of times, we mean a really large number of times, in theory, though not in practice. If you’re working over an elliptic curve with on the order of 2256 elements, and x is on the order of 2256, then xg is the result of adding x to itself on the order of 2256 times. But in practice you’d double g on the order of 256 times. See fast exponentiation.

In the post on FFDHE we said that you have to be careful that your choice of prime and generator doesn’t give the group structure that a cryptanalysist could exploit. This is also true for the elliptic curves used in ECDHE, and even more so because elliptic curves are more subtle than finite fields.

If large-scale quantum computing ever becomes practical, Diffie-Hellman encryption will be broken because a quantum computer can solve discrete logarithm problems efficiently via Schor’s algorithm. This applies equally to finite fields and elliptic curves.

Related posts

Chinese Remainder Theorem synthesis algorithm

Suppose m = pq where p and q are large, distinct primes. In the previous post we said that calculations mod m can often be carried out more efficiently by working mod p and mod q, then combining the results to get back to a result mod m.

The Chinese Remainder Theorem assures us that the system of congruences

\begin{align*} x &\equiv a\pmod p \\ x &\equiv b\pmod q \end{align*}

has a unique solution mod m, but the theorem doesn’t say how to compute x efficiently.

H. L. Garner developed an algorithm to directly compute x [1]:

x = p \biggl(\big(\,(a-b)(q^{-1}\bmod p\,)\big)\bmod p\biggr) + b

You compute the inverse of q mod p once and save it, then solving the system above for multiple values of a and b is very efficient.

Garner’s algorithm extends to more than two factors. We will present the general case of his algorithm below, but first we do a concrete example with RSA keys.

RSA key example

This is a continuation of the example at the bottom of this post.

This shows that the numbers in the key file besides those that are strictly necessary for the RSA algorithm are numbers needed for Garner’s algorithm.

What the key file calls “coefficient” is the inverse of q modulo p.

What the key file calls “exponent1” is the the decryption exponent d reduced mod p-1. Similarly, “exponent2” is d reduced mod q-1 as explained here.

    from sympy import lcm

    prime1 = 0xf33514...d9
    prime2 = 0xfee496...51

    publicExponent = 65537
    privateExponent = 0x03896d...91

    coefficient = 0x4d5a4c...b7 # q^-1 mod p
    assert(coefficient*prime2 % prime1 == 1)

    exponent1 = 0x37cc69...a1 # e % phi(p)
    exponent2 = 0x2aa06f...01 # e % phi(q)
    assert(privateExponent % (prime1 - 1) == exponent1)
    assert(privateExponent % (prime2 - 1) == exponent2)

More factors

Garner’s algorithm can be used more generally when m is the product of more than two primes [2]. Suppose

m = \prod_{i=1}^n m_i

where the mi are pairwise relatively prime (not necessarily prime). Then the system of congruences

x \equiv x_i \pmod {m_i}

for i = 1, 2, 3, …, n can be solved by looking for a solution of the form

x = v_1 + v_2 m_1 + v_3 m_1m_2 + \cdots + v_n \prod_{i=1}^{n-1} m_i

where

v_k \equiv \bigg( x_k - \big(v_1 + v_2m_1 + \cdots + v_{k-1} \prod_{i=1}^{k-2}m_i \big) \bigg) \bigg( \prod_{i=1}^{k-1} m_i\bigg)^{-1} \pmod {m_k}

Again, in practice the modular inverses of the products of the ms would be precomputed and cached.

Related posts

[1] Ferguson, Schneier, Kohno. Cryptography Engineering. Wiley. 2010.

[2] Geddes, Czapor, and Labahn. Algorithms for Computer Algebra. Kluwer Academic Publishers. 1992.

Checksum polynomials

A large class of checksum algorithms have the following pattern:

  1. Think of the bits in a file as the coefficients in a polynomial P(x).
  2. Divide P(x) by a fixed polynomial Q(x) mod 2 and keep the remainder.
  3. Report the remainder as a sequence of bits.

In practice there’s a little more to the algorithm than this, such as appending the length of the file, but the above pattern is at the heart of the algorithm.

There’s a common misconception that the polynomial Q(x) is irreducible, i.e. cannot be factored. This may or may not be the case.

CRC-32

Perhaps the most common choice of Q is

Q(x) = x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x3 + x2 + x + 1

This polynomial is used in the cksum utility and is part of numerous standards. It’s know as CRC-32 polynomial, though there are other polynomials occasionally used in 32-bit implementations of the CRC algorithm. And it is far from irreducible as the following Mathematica code shows. The command

    Factor[x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + 
           x^11 + x^10 + x^8 +  x^7 + x^5 + x^4 + 
           x^3 + x^2 + x + 1, Modulus -> 2]

shows that Q can be factored as

(1 + x)5 (1 + x + x3 + x4 + x6) (1 + x + x2 + x5 + x6)
(1 + x + x4 + x6 + x7) (1 + x + x4 + x5 + x6 + x7 + x8)

(Mathematica displays polynomials in increasing order of terms.)

Note that the factorization is valid when done over the field with 2 elements, GF(2). Whether a polynomial can be factored, and what the factors are, depends on what field you do your arithmetic in. The polynomial Q(x) above is irreducible as a polynomial with real coefficients. It can be factored working mod 3, for example, but it factors differently mod 3 than it factors mod 2. Here’s the factorization mod 3:

(1 + 2 x2 + 2 x3 + x4 + x5) (2 + x + 2 x2 + x3 + 2 x4 + x6 + x7)
(2 + x + x3 + 2 x7 + x8 + x9 + x10 + 2 x12 + x13 + x15 + 2 x16 + x17 + x18 + x19 + x20)

CRC-64

The polynomial

Q(x) = x64 + x4 + x3 + x + 1

is known as CRC-64, and is part of several standards, including ISO 3309. This polynomial is irreducible mod 2 as the following Mathematica code confirms.

    IrreduciblePolynomialQ[x^64 + x^4 + x^3 + x + 1, Modulus -> 2]

The CRC algorithm uses this polynomial mod 2, but out of curiosity I checked whether it is irreducible in other contexts. The following code tests whether the polynomial is irreducible modulo the first 100 primes.

    Table[IrreduciblePolynomialQ[x^64 + x^4 + x^3 + x + 1, 
        Modulus -> Prime[n]], {n, 1, 100}]

It is irreducible mod p for p = 2, 233, or 383, but not for any other primes up to 541. It’s also irreducible over the real numbers.

Since Q is irreducible mod 2, the check sum essentially views its input P(x) as a member of the finite field GF(264).

Related posts

Angles between words

Natural language processing represents words as high-dimensional vectors, on the order of 100 dimensions. For example, the glove-wiki-gigaword-50 set of word vectors contains 50-dimensional vectors, and the the glove-wiki-gigaword-200 set of word vectors contains 200-dimensional vectors.

The intent is to represent words in such a way that the angle between vectors is related to similarity between words. Closely related words would be represented by vectors that are close to parallel. On the other hand, words that are unrelated should have large angles between them. The metaphor of two independent things being orthogonal holds almost literally as we’ll illustrate below.

Cosine similarity

For vectors x and y in two dimensions,

\mathbf{x} \cdot \mathbf{y} = ||\mathbf{x} || \,||\mathbf{y} || \, \cos(\theta)

where θ is the angle between the vectors. In higher dimensions, this relation defines the angle θ in terms of the dot product and norms:

\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{ ||\mathbf{x} || \,||\mathbf{y} || }

The right-hand side of this equation is the cosine similarity of x and y. NLP usually speaks of cosine similarity rather than θ, but you could always take the inverse cosine of cosine similarity to compute θ. Note that cos(0) = 1, so small angles correspond to large cosines.

Examples

For our examples we’ll use gensim with word vectors from the glove-twitter-200 model. As the name implies, this data set maps words to 200-dimensional vectors.

Note that word embeddings differ in the data they were trained on and the algorithm used to produce the vectors. The examples below could be very different using a different source of word vectors.

First some setup code.

    
    import numpy as np
    import gensim.downloader as api
    word_vectors = api.load("glove-twitter-200")

    def norm(word):
         v = word_vectors[word]
        return np.dot(v, v)**0.5

    def cosinesim(word0, word1):
        v = word_vectors[word0]
        w = word_vectors[word1]
        return np.dot(v, w)/(norm(word0)*norm(word1))

Using this mode, the cosine similarity between “dog” and “cat” is 0.832, which corresponds to about a 34° angle. The cosine similarity between “dog” and “wrench” is 0.145, which corresponds to an angle of 82°. A dog is more like a cat than like a wrench.

The similarity between “dog” and “leash” is 0.487, not because a dog is like a leash, but because the word “leash” is often used in the same context as the word “dog.” The similarity between “cat” and “leash” is only 0.328 because people speaking of leashes are more likely to also be speaking about a dog than a cat.

The cosine similarity between “uranium” and “walnut” is only 0.0054, corresponding to an angle of 89.7°. The vectors associated with the two words are very nearly orthogonal because the words are orthogonal in the metaphorical sense.

Note that opposites are somewhat similar. Uranium is not the opposite of walnut because things have to have something in common to be opposites. The cosine similarity of “expensive” and “cheap” is 0.706. Both words are adjectives describing prices and so in some sense they’re similar, though they have opposite valence. “Expensive” has more in common with “cheap” than with “pumpkin” (similarity 0.192).

The similarity between “admiral” and “general” is 0.305, maybe less than you’d expect. But the word “general” is kinda general: it can be used in more contexts than military office. If you add the vectors for “army” and “general”, you get a vector that has cosine similarity 0.410 with “admiral.”

Related posts

Productive constraints

This post will discuss two scripting languages, but that’s not what the post is really about. It’s really about expressiveness and (or versus) productivity.

***

I was excited to discover the awk programming language sometime in college because I had not used a scripting language before. Compared to C, awk was high-level luxury.

Then a few weeks later someone said “Have you seen Perl? It can do everything awk can do and a lot more.” So I learned Perl. Was that or a good idea or a bad idea? I’ve been wondering about that for years.

Awk versus Perl is a metaphor for a lot of other decisions.

***

Awk is a very small language, best suited for working with tabular data files. Awk implicitly loops over a file, repeating some code on every line of a file. This makes it possible to write very short programs, programs so short that they can be typed at the command line, for doing common tasks. I am continually impressed by bits of awk code I see here and there, where someone has found a short, elegant solution to a problem.

Because awk is small and specialized, it is also efficient at solving the problems it is designed to solve. The previous post gives an example.

The flip side of awk being small and specialized is that it can be awkward to use for problems that deviate from its sweet spot.

***

Perl is a very expressive programming language and is suitable for a larger class of problems than awk is. Awk was one of the influences in the design of Perl, and you can program in an awk-like subset of Perl. So why not give yourself more options and write Perl instead?

Expressiveness is mostly good. Nobody is forcing you to use any features you don’t want to use and it’s nice to have options. But expressiveness isn’t a free option. I’ll mention three costs.

  1. You might accidentally use a feature that you don’t intend to use, and rather than getting an error message you get unexpected behavior. This is not a hypothetical risk but a common experience.
  2. If you have more options, so does everyone writing code that you might need to debug or maintain. “Expressiveness for me but not for thee.”
  3. More options means more time spent debating options. Having too many options dissipates your energy.

***

You can mitigate #1 by turning on warnings available in later versions of Perl. And you can mitigate #2 and #3 by deciding which language features you (or your team) will use and which features you will avoid.

But if you use a less expressive language, these decisions have been made for you. No need to decide on and enforce rules on features to shun. Avoiding decision fatigue is great, if you can live with the decisions that have been made for you.

The Python community has flourished in part because the people who don’t like the language’s design choices would rather live with those choices than leave these issues permanently unsettled.

***

Bruce Lee famously said “I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times.” You could apply that aphorism by choosing to master a small language like awk, learning not just its syntax but its idioms, and learning it so well that you never need to consult documentation.

Some people have done this, mastering awk and a few other utilities. They write brief scripts that do tasks that seem like they would require far more code. I look at these scripts expecting to see utilities or features that I didn’t know about, but usually these people have combined familiar pieces in a clever way.

***

Some people like to write haiku and some free verse. Hedgehogs and foxes. Scheme and Common Lisp. Birds and Frogs. Awk and Perl. So the optimal size of your toolbox is to some extent a matter of personality. But it’s also a matter of tasks and circumstances. There are no solutions, only trade-offs.

Sort and remove duplicates

A common idiom in command line processing of text files is

    ... | sort | uniq | ...

Some process produces lines of text. You want to pipe that text through sort to sort the lines in alphabetical order, then pass it to uniq to filter out all but the unique lines. The uniq utility only removes adjacent duplicates, and so it will not remove all duplicates unless the input is sorted. (At least the duplicate lines need to be grouped together; the groups need not be in any particular order.)

When given the -u flag, sort will sort and remove duplicates. This says the idiom above could be shortened to

    ... | sort -u | ...

Aside from saving a few keystrokes, is there any advantage to the latter? There could be, depending on how sort -u is implemented. If internally it simply sorts its input and then removes duplicates, then there is no advantage. But if the code simultaneously sorts and removes duplicates, it could save memory and time, depending on the input. If the code discarded duplicates as they were encountered, the code would need working memory proportional to the amount of unique input rather than the total amount of input.

I had a project recently that makes a good test case for this. The Gutenberg text corpus contains a list of words for 55,000 different sources, each in a separate text file. There are a lot of files, and there is a lot of redundancy between files. The combined file is 3.4 GB.

Running sort | uniq on the combined file took 610 seconds.

Running sort -u on the file took 394 seconds.

So in this example, sort -u not only saved a few keystrokes, it took about 35% off the time.

I expected it would save even more time. Maybe a custom program optimized for large files with a lot of redundancy could be even faster.

Update: awk

Bob Lyon’s comment made me think about using awk without concatenating the files together. Each of the 55,000 text files contains a word and a count. I didn’t concatenate the files per se but piped each through cut -f 1 to select the first column.

Using awk I created an associative array (a hash table, but awk calls them associative arrays) for each word, then printed out the keys of the array.

    awk '{count[$1]++}; 
        END {for (key in count) {print key}}' | 
        sort > out

This ran in 254 seconds, 58% faster than sort | uniq and 36% faster than sort -u.

There is a lot of redundancy between the files—the list of unique words is 320 times smaller than the union of all the input files—and the awk script takes advantage of this by maintaining an array roughly the size of the output rather than the size of the input.

Tim Chase suggested a more elegant solution:

    awk '!count[$1]++ {print $1}' *.txt | sort > out

It seems this should be more efficient as well since awk only goes through the data once, immediately printing new values as they are encountered. However, it actually took slightly longer, 263 seconds.

Swish, mish, and serf

Swish, mish, and serf are neural net activation functions. The names are fun to say, but more importantly the functions have been shown to improve neural network performance by solving the “dying ReLU problem.” This happens when a large number of node weights become zero during training and do not contribute further to the learning process. The weights may be “trying” to be negative, but the activation function won’t allow it. The swish family of activation functions allows weights to dip negative.

Softplus can also be used as an activation function, but our interest in softplus here is as part of the definition of mish and serf. Similarly, we’re interested in the sigmoid function here because it’s used in the definition of swish.

Softplus, softmax, softsign

Numerous functions in deep learning are named “soft” this or that. The general idea is to “soften” a non-differentiable function by approximating it with a smooth function.

The plus function is defined by

x^+ =<br /> \left\{<br /> \begin{array}{ll}<br /> x & \mbox{if } x \geq 0 \\<br /> 0 & \mbox{if } x < 0 \end{array} \right.

In machine learning the plus is called ReLU (rectified linear unit) but x+ is conventional mathematical notation.

The softplus function approximates x+ by

\text{softplus}(x) = \log(1 + \exp(x))

We can add a parameter to the softplus function to control how soft it is.

\text{softplus}(x; k) = \frac{1}{k}\log\left(1 + \exp(kx)\right)

As the parameter k increases, softmax becomes “harder.” It more closely approximates x+ but at the cost of the derivative at 0 getting larger.

We won’t need other “soft” functions for the rest of the post, but I’ll mention a couple more while we’re here. Softmax is a smooth approximation to the max function

\text{softmax}(x, y) = \log\left(\exp(x) + \exp(y)\right)

and softsign is a smooth approximation to the sign function.

\text{softsign}(x) = \frac{x}{1 + |x|}

which is differentiable at the origin, despite the absolute value in the denominator. We could add a sharpness parameter k to the softmax and softsign functions like we did above.

Sigmoid

The sigmoid function, a.k.a. the logistic curve, is defined by

\sigma(x) = \frac{\exp(x)}{1 + \exp(x)} = \frac{1}{1 + \exp(-x)}

The second and third expressions are clearly equal, but I prefer to think in terms of the former, but the latter is better for numerical calculation because it won’t overflow for large x.

The sigmoid function is the derivative of the softplus function. We could define a sigmoid function with a sharpness parameter k by takind the derivative of the softplus function with sharpness parameter k.

Swish

The Swish function was defined in [1] as

\text{swish}(x) = x \, \sigma(x)

Like softplus, the Swish function is asymptotically 0 as x → −∞ and asymptotically equal to x as x → ∞. That is, for large |x|, swish(x) is approximately x+. However, unlike softplus, swish(x) is not monotone. It is negative for negative x, having a minimum around -1.278 [2]. The fact that the activation function can dip into negative territory is what addresses “the dying ReLU problem,”

Mish

The Mish function was defined in [3] as

\text{mish}(x) = x \, \tanh(\,\text{softplus}(x)\,)

The mish function has a lot in common with the swish function, but performs better, at least on the examples the author presents.

The mish function is said to be part of the swish function family, as is the serf function below.

Serf

The serf function [4] is another variation on the swish function with similar advantages.

\text{serf}(x) = x \, \text{erf}(\,\text{softplus}(x)\,)

The name “serf” comes from “log-Softplus ERror activation Function” but I’d rather pronounce it “zerf” from “x erf” at the beginning of the definition.

The serf function is hard to distinguish visually from the mish function; I left it out of the plot above to keep the plot from being cluttered. Despite the serf and mish functions being close together, the authors in [4] says the former outperforms the latter on some problems.

Numerical evaluation

The most direct implementations of the softplus function can overflow, especially when implemented in low-precision floating point such as 16-bit floats. This means that the mish and serf functions could also overflow.

For large x, exp(x) could overflow, and so computing softplus as

   log(1 + exp(x))

could overflow while the exact value would be well within the range of representable numbers. A simple way to fix this would be to have the softmax function return x when x is so large that exp(x) would overflow.

As mentioned above, the expression for the sigmoid function with exp(-x) in the denominator is better for numerical work than the version with exp(x) in the numerator.

Related posts

[1] Swish: A Self-Gated Activation Function. Prajit Ramachandran∗, Barret Zoph, Quoc V. arXiv:1710.05941v1

[2] Exact form of the minimum location and value in terms of the Lambert W function discussed in the next post.

[3] Mish: A Self Regularized Non-Monotonic Activation Function. Diganta Misra. arXiv:1908.08681v3

[4] SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function. Sayan Nag, Mayukh Bhattacharyya. arXiv:2108.09598v3

Generating and inspecting an RSA private key

In principle you generate an RSA key by finding two large prime numbers, p and q, and computing n = pq. You could, for example, generate random numbers by rolling dice, then type the numbers into Mathematica to test each for primality until you find a couple prime numbers of the right size.

In practice you’d use a specialized program to find the primes and to wrap everything up in a format that software using the keys can understand. There are a lot of layers between the numbers p and q and the file that key generating software produces, and this post aims to peel back these layers a bit.

Here’s an example of generating a private key taken from The OpenSSL Cookbook.

    openssl genpkey -out fd.key -algorithm RSA \
      -pkeyopt rsa_keygen_bits:2048 -aes-128-cbc

The genpkey function can be used for generating several kinds of public keys. The option -algorithm RSA tells it that we want an RSA key, but we could have asked for an elliptic curve key. As noted in the previous post, in practice public key encryption is used to transfer symmetric encryption keys, not messages per se. The flag -aes-128-cbc tells the software the we’d like to use AES encryption with a 128-bit key in CBC (cipher block chaining) mode.

When you press enter you’ll see a flurry of dots and plus signs that show the progress of the software in generating and testing candidates for the primes p and q. Then you’ll be prompted for a password to encrypt the private key you’ve just created.

If you open the fd.key file you won’t see much:

    % cat fd.key
    -----BEGIN ENCRYPTED PRIVATE KEY-----
    MIIFLTBXBgkqhkiG9w0BBQ0wSjApBgkqhkiG9w0BBQwwHAQIdCZSKfkqh6kCAggA
    MAwGCCqGSIb3DQIJBQAwHQYJYIZIAWUDBAECBBAqbtHXkZ+uqa3rvj6qKqbRBIIE
    ...
    U6QCPcWukFyUAghHdTfjKgoAEXfOEunALoaTF6LMPsd6
    -----END ENCRYPTED PRIVATE KEY-----

This is just base 64-encoded data.

The data is encoded in two senses. It is encoded in a non-secret way, expressed in a standardized data structure, then encoded in the sense of being encrypted. The openssl command pkey will undo both levels of encoding to let us see the contents of the file.

Here’s what this produces.

    Private-Key: (2048 bit, 2 primes)
    modulus:
        00:a7:b8:39:80:0b:18:d9:db:c1:a3:c1:3a:92:89:
        ...
        7a:c5
    publicExponent: 65537 (0x10001)
    ...
    prime1:
        00:dc:8c:27:e6:7f:1c:11:d4:9c:8c:33:bf:07:57:
        ...
        97:5f:8c:4c:44:23:d2:85:f9
    prime2:
        00:c2:ae:20:80:87:da:d0:a1:66:8f:2e:90:7c:ae:
        ...
        9c:e9:8a:8b:bc:c7:71:de:2d
    ...

The exponent is the default value 65537. (More on that here.)

The large numbers are displayed in hexadecimal with colons separating pairs of hex digits. If you remove the colons and concatenate everything together, you can verify that the number called modulus is indeed the product of the numbers called prime1 and prime2. I verified this for the output above using a little Python code:

    modulus = 0xa7b839...c5
    prime1  = 0xdc8c27...f9
    prime2  = 0xc2ae20...2d
    assert(prime1*prime2 == modulus)

The file also contains four numbers that require more explanation: privateExponent, exponent1, exponent2, and coefficient. The privateExponent is described here. The remaining numbers are not strictly necessary for RSA but are used in Garner’s algorithm for more efficient decryption.

Related posts