Comparing bfloat16 range and precision to other 16-bit numbers

server rack

Deep learning has spurred interest in novel floating point formats. Algorithms often don’t need as much precision as standard IEEE-754 doubles or even single precision floats. Lower precision makes it possible to hold more numbers in memory, reducing the time spent swapping numbers in and out of memory. Also, low-precision circuits are far less complex. Together these can benefits can give significant speedup.

Here I want to look at bfloat16, or BF16 for short, and compare it to 16-bit number formats I’ve written about previously, IEEE and posit. BF16 is becoming a de facto standard for deep learning. It is supported by several deep learning accelerators (such as Google’s TPU), and will be supported in Intel processors two generations from now.

Bit layout

The BF16 format is sort of a cross between FP16 and FP32, the 16- and 32-bit formats defined in the IEEE 754-2008 standard, also known as half precision and single precision.

BF16 has 16 bits like FP16, but has the same number of exponent bits as FP32. Each number has 1 sign bit. The rest of the bits in each of the formats are allocated as in the table below.

    | Format | Bits | Exponent | Fraction |
    | FP32   |   32 |        8 |       23 |
    | FP16   |   16 |        5 |       10 |
    | BF16   |   16 |        8 |        7 |

BF16 has as many bits as a FP16, but as many exponent bits as a FP32. The latter makes conversion between BF16 and FP32 simple, except for some edge cases regarding denormalized numbers.


The epsilon value, the smallest number ε such that 1 + ε > 1 in machine representation, is 2e where e is the number of fraction bits. BF16 has much less precision near 1 than the other formats.

    | Format |    Epsilon |
    | FP32   | 0.00000012 |
    | FP16   | 0.00390625 |
    | BF16   | 0.03125000 |

Dynamic range

The dynamic range of bfloat16 is similar to that of a IEEE single precision number. Relative to FP32, BF16 sacrifices precision to retain range. Range is mostly determined by the number of exponent bits, though not entirely.

Dynamic range in decades is the log base 10 of the ratio of the largest to smallest representable positive numbers. The dynamic ranges of the numeric formats are given below. (Python code to calculate dynamic range is given here.)

    | Format | DR    |
    | FP32   | 83.38 |
    | BF16   | 78.57 |
    | FP16   | 12.04 |

Comparison to posits

The precision and dynamic range of posit numbers depends on how many bits you allocate to the maximum exponent, denoted es by convention. (Note “maximum.” The number of exponent bits varies for different numbers.) This post explains the anatomy of a posit number.

Posit numbers can achieve more precision and more dynamic range than IEEE-like floating point numbers with the same number of bits. Of course there’s no free lunch. Posits represent large numbers with low precision and small numbers with high precision, but this trade-off is often what you’d want.

For an n-bit posit, the number of fraction bits near 1 is n – 2 – es and so epsilon is 2 to the exponent es – n – 2. The dynamic range is

\log_{10}\left( 2^{2^{\text{\small\emph{es}}} } \right)^{2n-4} = (2n-4)2^{es}\log_{10}2

which is derived here. The dynamic range and epsilon values for 16-bit posits with es ranging from 1 to 4 are given in the table below.

    | es |     DR |   epsilon |
    |  1 |  16.86 | 0.0000076 |
    |  2 |  33.82 | 0.0000153 |
    |  3 |  37.43 | 0.0000305 |
    |  4 | 143.86 | 0.0000610 |

For all the values of es above, a 16-bit posit number has a smaller epsilon than either FP16 or BF16. The dynamic range of a 16-bit posit is larger than that of a FP16 for all values of es, and greater than BF16 and FP32 when es = 4.

Related posts

Why “work smarter, not harder” bothers me

welder working hard

One of my most popular posts on Twitter was an implicit criticism of the cliché “work smarter, not harder.”

I agree with the idea that you can often be more productive by stepping back and thinking about what you’re doing. I’ve written before, for example, that programmers need to spend less time in front of a computer.

But one thing I don’t like about “work smarter” is the implication that being smart eliminates the need to work hard. It’s like a form of gnosticism.

Also, “working smarter” is kind of a given. People don’t often say “I know of a smarter way to do this, but I prefer working hard at the dumb way.” [1] Instead, they’re being as smart as they know how, or at least they think they are. To suggest otherwise is to insult their intelligence.

One way to “work smarter, not harder” is to take good advice. This is different from “working smarter” in the sense of thinking alone in an empty room, waiting for a flash of insight. Maybe you’re doing what you’re doing as well as you can, but you should be doing something else. Maybe you’re cleverly doing something that doesn’t need to be done.

Related links

[1] If they do, they’re still being smart at a different level. Someone might think “Yeah, I know of a way to do this that would be more impressive. But I’m going to take a more brute-force approach that’s more likely to be correct.” Or they might think “I could imagine a faster way to do this, but I’m too tired right now to do that.” They’re still being optimal, but they’re including more factors in the objective they’re trying to optimize.

New expansions of confluent hypergeometric function

Last week Bujanda et al published a paper [1] that gives new expansions for the confluent hypergeometric function. I’ll back up explain what that means before saying more about the new paper.

Hypergeometric functions

Hypergeometric functions are something of a “grand unified theory” of special functions. Many functions that come up in application are special cases of hypergeometric function, as shown in the diagram below. (Larger version here.)

special function relationships

I give a brief introduction to hypergeometric functions in these notes (4-page pdf).

Confluent hypergeometric functions

The confluent hypergeometric function corresponds to Hypergeometric 1F1 in the chart above [2]. In Mathematica it goes by Hypergeometric1F1 and in Python goes by scipy.special.hyp1f1.

This function is important in probability because you can use it to compute the CDFs for the normal and gamma distributions.

New expansions

Bujanda et al give series expansions of the confluent hypergeometric functions in terms of incomplete gamma functions. That may not sound like progress because the incomplete gamma functions are still special functions, but the confluent hypergeometric function is a function of three variables M(ab; z) whereas the incomplete gamma function γ(sz) is a function of two variables.

They also give expansions in terms of elementary functions which may be of more interest. The authors give the series expansion below.

M_n(a, b; z) = \frac{\Gamma(b)}{\Gamma(a)\,\Gamma(b-a)} \sum_{k=0}^{n-1} A_k(a, b)\, F_k(z)

The A functions are given recursively by


A_n(a, b) = \frac{2}{n} \left( (2a - b)A_{n-1}(a, b) + 2(n-b)A_{n-2}(a,b) \right) 

for n > 1.

The F functions are given by

F_n(z) = \frac{n!}{(-z)^{n+1}} \Big( e_n(z/2) - \exp(z) e_n(-z/2) \Big)


e_n(z) = \sum_{k=0}^n \frac{z^k}{k!}

The authors give approximation error bounds in [1]. In the plot below we show that n = 3 makes a good approximation for M(3, 4.1, z). The error converges to zero uniformly as n increases.

Related posts

[1] Blanca Bujanda, José L. López, and Pedro J. Pagola. Convergence expansions of the confluent hypergeometric functions in terms of elementary functions. Mathematics of computation. DOI

[2] “Confluent” is a slightly archaic name, predating a more systematic naming for hypergeometric functions. The name mFn means the hypergeometric function has m upper parameters and n lower parameters. The confluent hypergeometric function has one of each, so it corresponds to 1F1. For more details, see these notes.

Big data and privacy

patient survey

How does big data impact privacy? Which is a bigger risk to your privacy, being part of a little database or a big database?

Rows vs Columns

People commonly speak of big data in terms of volume—the “four v’s” of big data being volume, variety, velocity, and veracity—but what we’re concerned with here might better be called “area.” We’ll think of our data being in one big table. If there are repeated measures on an individual, think of them as more columns in a denormalized database table.

In what sense is the data big: is it wide or long? That is, if we think of the data as a table with rows for individuals and columns for different fields of information on individuals, are there a lot of rows or a lot of columns?

All else being equal, your privacy goes down as columns go up. The more information someone has about you, the more likely some of it may be used in combination to identify you.

How privacy varies with the number of rows is more complicated. Your privacy could go up or down with the number of rows.

The more individuals in a dataset, the more likely there are individuals like you in the dataset. From the standpoint of k-anonymity, this makes it harder to indentify you, but easier to reveal information about you.

Group privacy

For example, suppose there are 50 people who have all the same quasi-identifiers as you do. Say you’re a Native American man in your 40’s and there are 49 others like you. Then someone who knows your demographics can’t be sure which record is yours; they’d have a 2% chance of guessing the right one. The presence of other people with similar demographics makes you harder to identify. On the other hand, their presence makes it more likely that someone could find out something you all have in common. If the data show that middle aged Native American men are highly susceptible to some disease, then the data imply that you are likely to be susceptible to that disease.

Because privacy measures aimed at protecting individuals don’t necessarily protect groups, some minority groups have been reluctant to participate in scientific studies. The Genetic Information Nondiscrimination Act of 2008 (GINA) makes some forms of genetic discrimination illegal, but it’s quite understandable that minority groups might still be reluctant to participate in studies.

Improving privacy with size

Your privacy can improve as a dataset gets bigger. If there are a lot of rows, the data curator can afford a substantial amount of randomization without compromising the value of the data. The noise in the data will not effect statistical conclusions from the data but will protect individual privacy.

With differential privacy, the data is held by a trusted curator. Noise is added not to the data itself but to the results of queries on the data. Noise is added in proportion to the sensitivity of a query. The sensitivity of a query often goes down with the size of the database, and so a differentially private query of a big dataset may only need to add a negligible amount of noise to maintain privacy.

If the dataset is very large, it may be possible to randomize the data itself before it enters the database using randomized response or local differential privacy. With these approaches, there’s no need for a trusted data curator. This wouldn’t be possible with a small dataset because the noise would be too large relative to the size of the data.

Related posts

What is proof-of-work?

The idea of proof of work (PoW) was first explained in a paper Cynthia Dwork and Moni Naor [1], though the term “proof of work” came later [2]. It was first proposed as a way to deter spam, but it’s better known these days through its association with cryptocurrency.

If it cost more to send email, even a fraction of a cent per message, that could be enough to deter spammers. So suppose you want to charge anyone $0.005 to send you an email message. You don’t actually want to collect their money, you just want proof that they’d be willing to spend something to email you. You’re not even trying to block robots, you just want to block cheap robots.

So instead of asking for a micropayment, you could ask the sender to solve a puzzle, something that would require around $0.005 worth of computing resources. If you’re still getting too much spam, you could increase your rate and by giving them a harder puzzle.

Dwork and Naor list several possible puzzles. The key is to find a puzzle that takes a fair amount of effort to solve but the solution is easy to verify.

Bitcoin uses hash problems for proof-of-work puzzles. Cryptographic hash functions are difficult to predict, and so you can’t do much better than brute force search if you want to come up with input whose hashed value has a specified pattern.

The goal is to add a fixed amount of additional text to a message such that when the hash function is applied, the resulting value is in some narrow range, such as requiring the first n bits to be zeros. The number n could be adjusted over time as needed to calibrate the problem difficulty. Verifying the solution requires computing only one hash, but finding the solution requires computing 2n hashes on average.

Related posts

[1] Cynthia Dwork and Noni Naor (1993). “Pricing via Processing, Or, Combatting Junk Mail, Advances in Cryptology”. CRYPTO’92: Lecture Notes in Computer Science No. 740. Springer: 139–147.

[2] Markus Jakobsson and Ari Juels (1999). “Proofs of Work and Bread Pudding Protocols”. Communications and Multimedia Security. Kluwer Academic Publishers: 258–272.

The acoustics of Hagia Sophia

Hagia Sophia

The Hagia Sophia (Greek for “Holy Wisdom”) was a Greek Orthodox cathedral from 537 to 1453. When the Ottoman Empire conquered Constantinople the church was converted into a mosque. Then in 1935 it was converted into a museum.

No musical performances are allowed in the Hagia Sophia. However, researchers from Stanford have modeled the acoustics of the space in order to simulate what worship would have sounded like when it was a medieval cathedral. The researchers recorded a virtual performance by synthesizing the acoustics of the building. Not only did they post-process the sound to give the singers the sound of being in the Hagia Sophia, they first gave the singers real-time feedback so they would sing as if they were there.

Related posts

Gamma gamma gamma!

There are several things in math and statistics named gamma. Three examples are

  • the gamma function
  • the gamma constant
  • the gamma distribution

This post will show how these are related. We’ll also look at the incomplete gamma function which connects with all the above.

The gamma function

The gamma function is the most important function not usually found on a calculator. It’s the first “advanced” function you’re likely to learn about. You might see it in passing in a calculus class, in a homework problem on integration by parts, but usually not there’s not much emphasis on it. But it comes up a lot in application.

\Gamma(x) = \int_0^\infty t^{x-1 } e^{-t}\, dt

You can think of the gamma function as a way to extend factorial to non-integer values. For non-negative integers n, Γ(n + 1) = n!.

(Why is the argument n + 1 to Γ rather than n? There are a number of reasons, historical and practical. Short answer: some formulas turn out simpler if we define Γ that way.)

The gamma constant

The gamma constant, a.k.a. Euler’s constant or the Euler-Mascheroni constant, is defined as the asymptotic difference between harmonic numbers and logarithms. That is,

\gamma = \lim_{n\to\infty} \left( 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}- \log n\right)

The constant γ comes up fairly often in applications. But what does it have to do with the gamma function? There’s a reason the constant and the function are both named by the same Greek letter. One is that the gamma constant is part of the product formula for the gamma function.

\frac{1}{\Gamma(x)} = x e^{\gamma x} \prod_{r=1}^\infty \left(1 + \frac{x}{r} \right) e^{-x/r}

If we take the logarithm of this formula and differentiation we find out that

\frac{\Gamma'(1)}{\Gamma(1)} = -\gamma

The gamma distribution

If you take the integrand defining the gamma function and turn it into a probability distribution by normalizing it to integrate to 1, you get the gamma distribution. That is, a gamma random variable with shape k has probability density function (PDF) given by

f(x) = \frac{1}{\Gamma(k)} x^{k-1} e^{-x}

More generally you could add a scaling parameter to the gamma distribution in the usual way. You could imaging the scaling parameter present here but set to 1 to make things simpler.

The incomplete gamma function

The incomplete gamma function relates to everything above. It’s like the (complete) gamma function, except the range of integration is finite. So it’s now a function of two variables, the extra variable being the limit of integration.

\gamma(s, x)= \int_0^x t^{s-1} e^{-t} \, dt

(Note that now x appears in the limit of integration, not the exponent of t. This notation is inconsistent with the definition of the (complete) gamma function but it’s conventional.)

It uses a lower case gamma for its notation, like the gamma constant, and is a generalization of the gamma function. It’s also essentially the cumulative distribution function of the gamma distribution. That is, the CDF of a gamma random variable with shape s is γ(sx) / Γ(s).

The function γ(sx) / Γ(s) is called the regularized incomplete gamma function. Sometimes the distinction between the regularized and unregularized versions is not explicit. For example, in Python, the function gammainc does not compute the incomplete gamma function per se but the regularized incomplete gamma function. This makes sense because the latter is often more convenient to work with numerically.

Related posts

East Coast Code and West Coast Code

In 1999, Harvard law professor Lawrence Lessig stated that software is a kind of law. The algorithms of tech giants have an influence comparable to, and maybe in some ways greater than, statutory law.

The code is law. … Activists concerned with defending liberty, privacy or access must watch the code coming from the Valley—call it West Coast Code—as much as the code coming from Congress—call it East Coast Code.

I imagine more people would agree with him now than did when he said this a couple decades ago.

If you can call statutory law “code” then you could also call algorithms “law” and speak of East Coast Law and West Coast Law.

Source: The Architecture of Privacy

What is differential privacy?

Differential privacy is a strong form of privacy protection with a solid mathematical definition.

Roughly speaking, a query is differentially private if it makes little difference whether your information is included or not. This intuitive idea can be made precise as follows.

blurry crowd

Queries and algorithms

First of a differential privacy is something that applies to queries, not to databases. It could apply to a database if your query is to select everything in the database, but typically you want to run far more specific queries. Differential privacy adds noise to protect privacy, and the broader the query, the more noise must be added, so typically you want queries to be more narrow than “Tell me everything about everything.”

Generally the differential privacy literature speaks of algorithms rather than queries. The two are the same if you have a general idea of what a query is, but using the word algorithm emphasizes that the query need not be a simple SQL query.

Opt-out and neighboring databases

To quantify the idea of “whether your information is included or not” we look at two versions of a database, D and D‘, that differ by one row, which you can think of as the row containing your data. Our formalism is symmetric, so we do not specify which database, D or D‘, is the one which contains your row and which is missing your row.

We should mention some fine print here. The paragraph above implicitly assumes that your database is just a simple table. In general we can speak of neighboring databases. In the case of a more complex database design, the notion of what it means for two databases to be neighboring is more complex. Differential privacy can be defined whenever a notion of neighboring databases can be defined, so you could, for example, consider differential privacy for a non-relational (“No SQL”) database. But for the simple case of a single table, two databases are neighboring if they differ by a row.

However, there’s a subtlety regarding what it means even for two tables to “differ by one row.” Unbound differential privacy assumes one row has been removed from one of the databases. Bound differential privacy assumes the two databases have the same number of rows, but the data in one row has been changed. In practice the difference between bound and unbound differential privacy doesn’t often matter.

Quantifying difference

We said it “makes little difference” whether an individual’s data is included or not. How to we quantify that statement? Differential privacy is a stochastic procedure, so we need to measure difference in terms of probability.

Whenever mathematicians want to speak of two things being close together, we often use ε as our symbol. While in principle ε could be large, the choice of symbol implies that you should think of it as a quantity you can make as small as you like (though there may be some cost that increases as ε decreases).

We’re going to look at ratios of probabilities, so we want these ratios to be near 1. This means we’ll work with eε = exp(ε) rather than ε itself. A convenient property that falls out of the Taylor series for the exponential function is that if ε is small then exp(ε) is approximate 1+ε. For example, exp(0.05) = 1.0513, and so if the ratio of two probabilities is bounded by exp(0.05), the probabilities are roughly within 5% of each other.

Definition of differential privacy

Now we can finally state our definition. An algorithm A satisfies ε-differential privacy if for every t in the range of A, and for every pair of neighboring databases D and D‘,

\frac{\mbox{Pr}(A(D) = t)}{\mbox{Pr}(A(D') = t)} \leq \exp(\varepsilon)

Here it is understood that 0/0 = 1, i.e. if an outcome has zero probability under both databases, differential privacy holds.

Related posts

Logic and applications Twitter account

I stopped posting to the @FormalFact Twitter account last July, but I didn’t deactivate the account. Now I’m going to restart it.

Unlike my other Twitter accounts, I don’t plan to have a regular posting schedule. I may not post often. We’ll see how it goes.

LogicPractice icon

I’ve changed the account name from @FormalFact to @LogicPractice. The “formal” part of the original name referred to formal theorem proving, the initial focus of the account. The new name reflects a focus on logic more generally, and practical applications of logic that are less laborious than formal theorem proving.

Continued fraction cryptography

Every rational number can be expanded into a continued fraction with positive integer coefficients. And the process can be reversed: given a sequence of positive integers, you can make them the coefficients in a continued fraction and reduce it to a simple fraction.

In 1954, Arthur Porges published a one-page article pointing out that continued fractions could be used to create a cipher. To encrypt your cleartext, convert it to a list of integers, use them as continued fraction coefficients, and report the resulting fraction. To decrypt, expand the fraction into a continued fraction and convert the coefficients back to text.

We can implement this in Mathematica as follows:

encode[text_] := FromContinuedFraction[ ToCharacterCode[ text ]]
decode[frac_] := FromCharacterCode[ ContinuedFraction[ frac ]]

So, for example, suppose we want to encrypt “adobe.” If we convert each letter to its ASCII code we get {97, 100, 111, 98, 101}. When we make these numbers coefficients in a continued fraction we get


which reduces to 10661292093 / 109898899.

This isn’t a secure encryption scheme by any means, but it’s a cute one. It’s more of an encoding scheme, a (non-secret) way to convert a string of numbers into a pair of numbers.

Related posts

[1] Arthur Porges. A Continued Fraction Cipher. The American Mathematical Monthly, Vol. 59, No. 4 (Apr., 1952), p. 236

Earth mover distance and t-closeness

airport crowd

There’s an old saying that if you want to hide a tree, put it in a forest. An analogous principle in privacy is that a record preserves privacy if it’s like a lot of other records.


The idea of k-anonymity is that every database record appears at least k times when you restrict your attention to quasi-identifiers [1]. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity. Even when k is large, k-anonymity might prevent re-identification but still suffer from attribute disclosure.

Another problem with k-anonymity is that it doesn’t offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.

Or going the other way around, if you know already know something that stands about a group, this could help you identify the record belonging to an individual. That is, the method is subject to a background knowledge attack.

One way to address this shortcoming is l-diversity. This post won’t go into l-diversity because it’s an intermediate step to where we want to go, which is t-closeness.


The idea of t-closeness is that the distribution of sensitive data in every group is not too far from the distribution in the full population. The “t” comes from requiring that the distributions be no more than a distance t apart in a sense that we’ll define below [2]. If the sensitive data in a group doesn’t stand out, this thwarts the homogeneity attack and the background knowledge attack.

Earth mover distance

When we say that the distribution on sensitive data within a group is not far from the distribution in the full data, how do we quantify what “far” means? That is, how do we measure the distance between two distributions?

There are a lot of ways to measure the similarity of two probability distributions. A common choice is the Kullback-Liebler divergence, though that’s not what we’ll use here. Instead, t-closeness uses the so-called earth mover distance (EMD), also know as the Wasserstein metric.

The idea of EMD is to imagine both probability distributions as piles of dirt and calculate the minimum amount of work needed to reshape the first pile so that it has the same shape as the second. The key attribute of EMD is that it takes distance into account.

Suppose your data is some ordered response, 1 through 5. Suppose distribution X has probability 0.8 at 1 and 0.05 for the rest of the responses. Distributions Y and Z are the same except they have 80% of their probability mass at 2 and at 5 respectively. By some measures, X is equally far from Y and Z, but the earth mover distance would say that X is closer to Y than to Z, which is more appropriate in our setting.

We can calculate the EMD for the example above. To transform X into Y, we need to move a probability mass of 0.75 from 1 to 2, and so the EMD is 0.75. To transform X into Z we need to move the same amount of mass, but we need to move it 4x further, and so the EMD is 3.

Related posts

[1] What exactly is a quasi-identifier? That’s hard to say precisely. An identifier is something that clearly identifies an individual. A phone number, for example, is an identifier. A quasi-identifier is something that helps narrow down who a person is, but isn’t an identifier. For example, someone’s birthday would be a quasi-identifier. But in general “quasi-identifier” is not a precisely defined concept.

[2] It’s common in math to use a variable name as an adjective, as with k-anonymity, l-diversity, and t-closeness. This is unfortunate because it isn’t descriptive and locks in a variable naming convention. Someone used the variable k to count the number of redundant database tables, and the name stuck. Similarly the l in l-diversity counts something.

As a sidenote, the variable names here follow the old FORTRAN convention that variables i through n represent integers by default, and that all other variables represent real (floating point) numbers by default. The t in t-closeness is a continuous measure, so the t is a real value, while k and l are integers.

Integration by long division

Since integration is the inverse of differentiation, you can think of integration as “dividing” by d.

J. P. Ballantine [1] shows that you can formally divide by d and get the correct integral. For example, he arrives at

\int x^2 \sin x \, dx = (2 - x^2 )\cos x + 2x \sin x + C

using long division!

[1] J. P. Ballantine. Integration by Long Division. The American Mathematical Monthly, Vol. 58, No. 2 (Feb., 1951), pp. 104-105

Biometric security and hypothesis testing

A few weeks ago I wrote about how there are many ways to summarize the operating characteristics of a test. The most basic terms are accuracy, precision, and recall, but there are many others. Nobody uses all of them. Each application area has their own jargon.

Biometric security has its own lingo, and it doesn’t match any of the terms in the list I gave before.

facial scan authentication

False Acceptance Rate

Biometric security uses False Acceptance Rate (FAR) for the proportion of times a system grants access to an unauthorized person. In statistical terms, FAR is Type II error. Also known as False Match Rate (FRM).

False Rejection Rate

False Rejection Rate (FRR) is the proportion of times a biometric system fails to grant access to an authorized person. In statistical terms, FRR is Type I error. FAR is also known as False Non Match Rate (FNMR).

Crossover Error Rate

One way to summarize the operating characteristics of a biometric security system is to look at the Crossover Error Rate (CER), also known as the Equal Error Rate (EER). The system has parameters that can be tuned to adjust the FAR and FRR. Adjust these to the point where the FAR and FRR are equal. When the two are equal, their common value is the CER or EER.

The CER gives a way to compare systems. The smaller the CER the better. A smaller CER value means it’s possible to tune the system so that both the Type I and Type II error rates are smaller than they would be for another system.

Loss Function

CER is kind of a strange metric. Everyone agrees that you wouldn’t want to calibrate a system so that FAR = FRR. In security applications, FAR (unauthorized access) is worse than FRR (authorized user locked out). The former could be a disaster while the latter is an inconvenience. Of course there could be a context where the consequences of FAR and FRR are equal, or that FRR is worse, but that’s not usually the case.

A better approach would be to specify a loss function (or its negative, a utility function). If unauthorized access is K times more costly than locking out an authorized user, then you might want to know at what point K * FAR = FRR or your minimum expected loss [1] over the range of tuning parameters. The better system for you, in your application, is the one corresponding to your value of K.

Since everyone has a different value of K, it is easier to just use K = 1, even though everyone’s value of K is likely to be much larger than 1. Unfortunately this often happens in decision theory. When people can’t agree on a realistic loss function, they standardize on a mathematically convenient implicit loss function that nobody would consciously choose.

If everyone had different values of K near 1, the CER metric might be robust, i.e. it might often make the right comparison between two different systems even though the criteria is wrong. But since K is probably much larger than 1 for most people, it’s questionable that CER would rank two systems the same way people would if they could use their own value of K.


[1] These are not the same thing. To compute expected loss you’d need to take into account the frequency of friendly and unfriendly access attempts. In a physically secure location, friends may attempt to log in much more often than foes. On a public web site the opposite is more likely to be true.

Modal and temporal logic for computer security


In the previous post, I mentioned that modal logic has a lot of interpretations and a lot of axiom systems. It can also have a lot of operators. This post will look at Security Logic, a modal logic for security applications based on a seminal paper by Glasgow et al [1]. It illustrates how modal and temporal logic can be applied to computer security, and it illustrates how a logic system can have a large number of operators and axioms.

Knowledge axioms

Security Logic starts with operators Ki that extend the box operator □. For a proposition p, Ki p is interpreted as saying that the ith agent knows p to be true. Glasgow et al chose the following axioms for inference regarding Ki.

Axiom K1:   Ki φ → φ
Axiom K2:   Ki(φ → ψ) → (Ki φ → Ki ψ)
Axiom K3:   Ki φ → Ki Ki φ
Axiom K4:   ¬ Ki φ → Ki( ¬ Ki φ)

These can be interpreted as follows.

If you know something to be true, it’s true.
If you know one thing implies another, then knowing the first lets you know the other.
If you know something, you know that you know it.
If you don’t know something, you know that you don’t know it.

This is not to claim that these propositions are universal metaphysical truths, only modeling assumptions. It may be possible, for example, to know something without knowing that you know it, but that’s a subtle matter excluded from this model.

Temporal logic axioms

Temporal logic is modal logic with operators interpreted as applying to propositions over time. Glasgow et al introduce three temporal operators: ∀□, ∀◇, and  ∃□. Note that each of these is a single operator; there is no box or diamond operator in Security Logic.

∀□p means that p always holds, ∀◇p means that p eventually holds, and ∃□p means that p sometimes holds.

The ∃□ and ∀□ operators are dual in the same sense that the basic modal operators □ and ◇ are dual:

∃□p ↔ ¬ ∀□ ¬p
∀□p ↔ ¬ ∃□ ¬p

Security Logic not give axioms for ∃□ because its behavior is determined by the axioms for ∀□.

The three temporal logic axioms for Security Logic are as follows.

Axiom T1:  φ → ∀◇φ
Axiom T2:  ∀□(φ → ψ) → (∀□φ → ∀□ψ)
Axiom T3:  ∀□φ → ∀□ ∀□φ

These can be interpreted as follows.

If a proposition holds, it eventually holds.
If its always the case that one proposition implies another, then if the former always holds then the latter always holds.
If something always holds, then it always always holds.

Obligation and Permission

Finally, Security Logic introduces modal operators O and P for obligation and permission.

Obligation and permission are dual, just as necessity and possibility are dual and the always and sometimes operators are dual.

Op ↔ ¬ P ¬p
Pp ↔ ¬ O ¬p

When obligation and permission are combined with knowledge, Security Logic introduces the following abbreviations.

Oi = OKi
Pi = PKi

Axioms for Security Logic

The complete axioms for Security Logic are the four knowledge axioms above, the three temporal axioms above, and the five additional axioms below.

Axiom SL1:   Pi φ for all propositional tautologies φ.
Axiom SL2:  Pi φ → φ
Axiom SL3:  (Pi φ ∧ Pi ψ) ↔ Pi (φ ∧ ψ)
Axiom SL4:  Pi φ → Pi (φ ∨ ψ)
Axiom SL5:  Oi φ → Pi φ

These can be interpreted as follows.

You are permitted to know all tautologies.
Whatever is permitted must be true.
Permission holds through conjunction and disjunction.
Whatever is obligatory must be permissible.

Related posts

[1] Janice Glasgow, Glenn MacEwen, and Prakash Panangaden. A Logic for Reasoning About Security. ACM Transactions on Computer Systems, Vol 10 No 3, August 1992, pp 226–264.