Uncategorized

A very accurate logarithm approximation

The previous post looked at an efficient way to approximate nth roots of fractions near 1 by hand. This post does the same for logarithms.

As before, we assume x = p/q and define

s = p + q
d = pq

Because we’re interested in values of x near 1, d is small, and small numbers are convenient to work with by hand.

In [1] Kellogg gives the approximation

log x ≈ 3(x² − 1)/((x+ 1)² + 2x) = 6ds/(3s² − d²)

So, for example, suppose we wanted to take the natural log of 7/8. then p = 7, q = 8, s = 15, and d = −1.

log x ≈ (6×15×(−1))/(3×225 − 1) = − 90/674 = − 45/337.

This approximation is good to six decimal places.

Kellogg claims that

This value of E [the natural logarithm], if q [what I’ve called x] be between .9 and 1.1, is true to the seventh decimal.

He then goes on to explain how to create an even more accurate approximation, and how to deal with larger values of x.

Here’s a plot verifying Kellogg’s claim.

Note the that scale of the plot is 10−8. As the flat spot in the middle suggests, you get even more decimal places for x closer to 1.

[1] Ansel N. Kellogg. Empirical formulæ; for Approximate Computation. The American Mathematical Monthly. February 1987, Vol. 4 No. 2, pp. 39–49.

 

Almost ASCII

I was working recently with a gigabyte file that had a dozen non-ASCII characters. This is very common. The ASCII character set is not quite big enough for a lot of tasks. Of course it’s completely inadequate if you’re writing Japanese, but it’s almost enough for documents written in English and a few other languages.

Efficient encoding

The world has standardized on Unicode as the way to represent characters across languages. Unicode currently has around 150,000 characters, far more than ASCII’s 128 characters.

But there’s a problem. Since 150,000 > 217, it takes more than two bytes (eight bits to a byte) to represent each of 150,000 things. If you use three bytes to represent each character, every file that is almost all ASCII will get three times bigger. If you limit yourself to the most frequently used Unicode characters, those that can be represented with two bytes (the “basic multilingual plane”), then you still double the size of files.

Enter UTF-8, a brilliant solution to this problem. The UTF-8 encoding of an ASCII file is an ASCII file. Pure ASCII files don’t get any larger when interpreted as UTF-8 encoded Unicode. Because 128 = 27, a byte representing an ASCII character has one unused bit. UTF-8 uses this unused bit to signal that what follows is not ASCII. I wrote about the full details here.

Unicode characters outside the ASCII range take 2, 3, or 4 bytes to represent. Inserting a small number of non-ASCII characters into a UTF-8 encoded Unicode file hardly changes the file’s size.

Troubleshooting

I mentioned at the top that I had a gigabyte file with a dozen non-ASCII characters. The command file -I reported the file encoding to be ASCII, because the vast majority of the file was ASCII. But the non-ASCII characters were not valid Unicode characters either.

These invalid Unicode characters would display as �, which is not actually in the file. The � is a valid Unicode character for representing an invalid Unicode character.

Some of the non-ASCII characters where extended ASCII (Windows 1252) characters, but if I remember correctly even that didn’t account for everything. Some of the odd characters were simply file corruption.

It’s kinda interesting how some tools are robust to these kinds of glitches and some are not. My first clue that something funny was going on was when sort refused to sort. I ran a Python script that helps me fix wonky text files and it threw an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 222662: invalid start byte

This may seem like gibberish, but it actually says exactly what’s going on. There was an error interpreting the file as Unicode, because 0x92 is not a valid way to start a non-ASCII character in UTF-8.

The first bit of an ASCII character is 0. The first two bits of a non-ASCII character in UTF-8 are 11. But 9 is 1001 in binary, i.e. it starts with 10, and so the byte 0x92 is neither an ASCII character nor the beginning of a UTF-8 non-ASCII sequence of bytes. More details here.

Removing non-ASCII characters

For my application I could just remove the invalid characters using iconv with the -c option.

iconv -c -f CP1252 -t UTF-8 inputfile > outputfile

If you need to salvage troublesome characters then things are a little more complicated. The iconv utility will work if you know what the intended encoding was. If you don’t the intended encoding, you may need to do some detective work.

Related posts

Additive functions

A function f from positive integers to real numbers is defined to be additive if for relatively prime numbers m and n,

f(mn) = f(m) + f(n).

The function f is called completely addititive if the above holds for all positive integers m and n, i.e. we drop the requirement that m and n are relatively prime.

Example: total prime factors

One example of an additive function is the function Ω(n) defined to be the number of prime factors of n, counted with multiplicity. For example, Ω(12) = 3 because 12 = 2 × 2 × 3. The numbers 10 and 63 are relatively prime, and

Ω(630) = 5 = Ω(10) + Ω(63).

Example: distinct prime factors

Another example of an additive function is ω(n) defined to be the number of distinct prime factors of n, i.e. not counting with multiplicity. So, for example, ω(12) = 2.

This function is additive but not completely additive because, for example,

ω(20) = 2 ≠ ω(2) + ω(10)  = 3

A theorem of Erdős

Here is a remarkable theorem due to Paul Erdős [1]. Suppose f is an additive function such that

f(n + 1) − f(n)

converges to zero as n goes to infinity. Then

f(n) = c log(n)

for some constant c. And since a multiple of a logarithm is a logarithm to a different base, we can restate the conclusion by simply saying f is a logarithm.

Logarithms are completely additive functions, so even though we only assumed f was additive, this combined with the limit condition proves that in fact f is completely additive.

Related posts

[1] Paul Erdős, “On the distribution function of additive functions,” Ann. of Math., Vol. 47 (1946), pp. 1–20.

Security by obscurity

Security-by-obscurity is a bad idea in general. It’s better, for example, to have a login page than to give your site an obscure URL. It’s better to encrypt a file than to hide it in some odd directory. It’s better to use a well-vetted encryption algorithm than to roll your own.

There are people whose knee-jerk reaction to any form of obscurity is to shout “That’s security-by-obscurity!” but obscurity can be subtle.

All else being equal, adding a layer of obscurity doesn’t hurt. For example, you can literally make a public encryption key public, as I’ve done here. But for extra security, why distribute your encryption key more widely than necessary? And if your message is adequately encrypted, you could in principle publish it for the world to see. But why not just give it to the intended recipient?

The public key on my site is there for strangers to contact me, but if I were really concerned about secure communication between colleagues, I’d just circulate the key among those colleagues. That may not be much more secure, but surely it’s no less secure. And I’d share messages privately, even though they are encrypted.

It’s good to look closely at any argument that beings “all else being equal” to see if all else is indeed equal. A more nuanced objection to security-by-obscurity is that it can create a false sense of security.

One could argue, for example, that making your public key available to the world forces you to be more careful about your encryption. Maybe you’ve been using an RSA key for years, and you really should use a longer key, but you don’t because you can argue that not many people have your public key anyway. But if your key’s too sort, obscuring your public key doesn’t help.

And while it’s better to deliver encrypted messages privately, it helps to not count on this, to assume that the encrypted message might be made public. That’s the basic premise behind encryption.

The principle behind no-security-by-obscurity is that you want to concentrate your security where it can be quantified. You can, for example, quantify how much more effort it would take to break a 64-bit key (like Blowfish) than a 56-bit key (like DES). Or even better, a 128-bit key (like AES). But you can’t quantify the level of protection that comes from obscurity.

Is it more secure to give someone a 56-bit DES key on a flash drive in a dark alley than to send them a 64-bit Blowfish key over SMS You can’t calculate an answer to that question.

In some sense all security is by obscurity. Cryptography literally means hidden writing. But all else being equal—there’s that phrase again—you want to minimize the surface area of what you have to obscure, e.g. limiting your secret to your key and not your algorithm, and it’s better to have quantified risks than unquantified risks. But all else is often not equal, and there are difficult trade-offs.

Related posts

How much metadata is in a photo?

A few days ago I wrote about the privacy implications of metadata in a PDF. This post will do the same for photos.

Dalek on a Seattle train

You can see the metadata in a photo using exiftool. By default cameras include time and location data. I ran this tool on a photo I took in Seattle a few years ago when I was doing some work for Amazon. The tool reported 114 fields, some of which are redundant. Here is some of the information contained in the metadata.

GPS Altitude  : 72.5 m Above Sea Level
GPS Date/Time : 2017:05:05 17:47:33.31Z
GPS Position  : 47 deg 36' 39.71" N, 122 deg 19' 59.40" W
Lens ID       : iPhone SE back camera 4.15mm f/2.2

How finely does this specify the location? The coordinates are given to 1/100 of a second, so 1/360000 of a degree. A degree of latitude is 111 km, so the implied accuracy is on the order of 30 cm or one foot, whether that’s correct or not.

You can look up that ground level at that location is 46 meters above sea level, which would imply the photo was taken on the 8th floor of a building. (It clearly wasn’t. Either the elevation of ground level or the elevation recorded in the phone isn’t correct.)

When I cropped the image, the edited image contained the software and operating system that was used to edit it.

Platform    : Linux
Software    : GIMP 2.10.30
Modify Date : 2024:02:13 08:39:49

This shows that I edited the image this morning using GIMP installed on a Linux box.

You can change your phone’s settings to not include location data in photos. If you do, the photos may still include the time zone, which is a weak form of location data. You can remove some or all the metadata later using image editing software, but by default a photo reveals more than you may intend.

More metadata posts

Related posts

Your PDF may reveal more than you intend

When you create a PDF file, what you see is not all you get. There is metadata embedded in the file that might be useful. It also might reveal information you’d rather not reveal.

The previous post looked at just the time stamp on a file. This post will look at more metadata, focusing on privacy implications.

Inspecting metadata

Here’s a little Python script we’ll use to inspect some of the metadata in a PDF. I say some because this does not pick out everything in every PDF.

    from pypdf import PdfReader

    def print_metadata(filename):
        print("File: ", filename, "\n")    
        reader = PdfReader(filename)
        meta = reader.metadata
        for m in meta:
            print(m, meta[m])

Let’s run this on the “Hello world” example from the previous post.

    File:  humpty.pdf

    /Creator Writer
    /Producer LibreOffice 7.5
    /CreationDate D:20240208064322-06'00'

OK, so this shows that the file was created with LibreOffice Writer, version 7.5.

Time and location

It also shows when the file was written. As I discussed in the previous post, the file was written today at 6:43:22. But what I didn’t comment on before was the -6'00' at the end. This is my time zone, six hours behind GMT, i.e. US Central Standard Time.

Note that the time zone isn’t just time information, it’s also location information. It’s no secret that I live in Houston, but if I didn’t want to reveal my location, this time stamp would partially give away where I live. (Probably. Strictly speaking it reveals the time zone setting on my computer.)

Microsoft Word files

I repeated my “Hello world” file experiment with Microsoft Word on an old laptop. When I exported to PDF I got the following.

    /Author John Cook
    /Creator Microsoft® Word 2016
    /CreationDate D:20240208101055-06'00'
    /ModDate D:20240208101055-06'00'
    /Producer Microsoft® Word 2016

So this includes my name. The installation program for Microsoft Office asks for your name, and I must have provided it. Either LibreOffice doesn’t ask or I didn’t enter it.

When I print to PDF rather than export to PDF I get slightly different output.

    /Author John
    /CreationDate D:20240208101220-06'00'
    /ModDate D:20240208101220-06'00'
    /Producer Microsoft: Print To PDF
    /Title Microsoft Word - Document1

LaTeX files

Now let’s look at a PDF created from a LaTeX file. I created a file foo.tex with the following content

    \documentclass{article}
    \begin{document}
    Hello world.
    \end{document}

then compiled it with pdflatex foo.tex. Let’s see what metadata our Python code can find.

    /Producer pdfTeX-1.40.25
    /Creator TeX
    /CreationDate D:20240208075059-06'00'
    /ModDate D:20240208075059-06'00'
    /Trapped /False
    /PTEX.Fullbanner This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/MacPorts 2023.66589_1) kpathsea version 6.3.5

Obviously the file was created with TeX [1]. You can usually identify TeX files by their appearance. You can make a TeX file look less distinctive by changing the default font and a few other things. But if you did so without changing the metadata, someone could still determine that the file was made using TeX.

I’m not trying to conceal that I use LaTeX. But if you create a PDF with an obscure program, maybe that reveals more than you’d like to reveal.

Operating system

You can see that the file was produced on a Mac. When I compiled the same file on my Linux desktop, it showed the operating system as Debian but was not any more specific.

When you see that a file was created using Microsoft Word, it was probably created on Windows. I don’t have Word on my Mac, but I wouldn’t be surprised if the application was reported to be something like Office for MacOS rather than just Word.

I created a document with Microsoft 365 online and it reported the following.

    /Author John Cook
    /Creator Microsoft Word
    /CreationDate D:20240208084209-08'00'
    /ModDate D:20240208084209-08'00'

The lack of an operating system in the Creator field may indicate that the document was created online. Note that the time zone is −8, i.e. Pacific Standard Time. This isn’t my time zone but the time zone of the server, perhaps in Seattle.

Related posts

[1] LaTeX is written on top of TeX. The metadata says the file was created with TeX, because ultimately it really was.

Bad takes on chaos theory

I just finished reading The Three Body Problem. At the end of the book is a preview of Cixin Liu’s book Supernova Era. A bit of dialog in that preview stood out to me because it is touches on themes I’ve written about before.

“I’ve heard about that. When a butterfly flaps its wings, there’s a hurricane on the other side of the world.”

“That’s right,” Specs said, nodding. A chaotic system.”

Huahua said, “I want to be that butterfly.”

Specs should his head again. “You don’t understand at all. We’re all butterflies, just like every butterfly. Every grain of sand and every drop of rain is a butterfly. That’s why the world is unpredictable.”

Most popular interpretations of chaos theory are misguided. Two such misguided interpretations are illustrated in the passage above.

When hearing of chaos theory, many jump to the same conclusion as the Huahua in the excerpt above who wants to be the butterfly that starts a hurricane. They think chaos theory implies that butterfly effects can be engineered. This is the optimistic fallacy.

Sometimes a small deliberate effort can lead to a large intended conclusion. But chaos theory would not predict this. In fact, reasoning by analogy from chaos theory would suggest this is impossible. More on that here and here.

Another misguided interpretation of chaos theory is the pessimistic fallacy that “the world is unpredictable,” as Specs says above. But we know that’s not true. Some aspects of the world are very predictable. As Orphan Annie says, the sun will come out tomorrow.

Even people who say the world is unpredictable don’t live as if the world were unpredictable. Deep down they know full well that in important ways the world is predictable.

Bell curve meme.

It’s true that not everything is as predictable as we may have imagined, weather being a famous example. Chaos theory was born out of the surprising observation that weather simulations are very sensitive to changes in initial conditions.

We do not live in a world in which we can tickle a particular butterfly in order to deliberately direct the course of the future. But neither do we live in a world without discernible causes and effects.

Straddling checkerboard encryption

Introduction

Computers fundamentally changed cryptography, opening up new possibilities for making and breaking codes. At first it may not have been clear which side benefited most, but now it’s clear that computers gave more power to code makers than code breakers.

We now have cryptographic primitives that cannot be attacked more efficiently than by brute force, as far as we know. The weak link is how these primitives are implemented and combined, not the primitives themselves.

Before computers there was more of a cat and mouse game between encryption and cryptanalysis. Encryption schemes that were convenient to carry out by hand could usually be broken by hand eventually. But if you only needed secrecy briefly, a simple scheme might provide that secrecy for long enough. This post will look at one such scheme, the straddling checkerboard.

Checkerboards

Perhaps the most obvious way to conveniently turn letters into numbers is to arrange the letters into a 5 × 5 grid. This has to leave out one letter, and in practice this meant combining I and J. Or if you needed digits, you could use a 6 × 6 grid and put J back in. You’d scramble the alphabet in the grid according to some key, then encrypt each letter by its coordinates.

       12345
      +-----
     1|EBISP
     2|XWLVN
     3|AOYZQ
     4|MDCKH
     5|RTUFG

This is no better than a simple substitution cipher because someone intercepting a message encrypted this way would easily guess that pairs of digits represent letters. However, if you then permuted the digits with a transposition cipher, you’d have something more formidable. This is essentially what the ADFGV cipher did, which stumped cryptanalysts for a while.

The straddling checkerboard is a variation on the method above. Letters would be arranged in a 3 × 10 grid rather than 5 × 5. Some letters would be encrypted as a single digit and some as a pair of digits.

       1234567890
      +----------
      |  EBISPXWL
     1|VNAOYZQMDC
     2|KHRTUFGJ./

In the example above, E would be encrypted as 3, N would be encrypted as 12, and so on. This is an instance of a prefix code. In order to be able to decode the digits unambiguously, no letter could be encoded as 1 or 2; these digits always signaled the beginning of a pair.

Prefix codes are often used in non-secret codes, such as country codes for telephone numbers. More examples of prefix codes in this post.

Because 1 and 2 could not be used to encode single letters, there were 28 slots to fill. These could be filled with other symbols, and in practice period and slash were added [1].

Efficiency

The straddling checkerboard gives a more efficient encoding than does the checkerboard since typically fewer digits will be required. If efficiency were the only concern, we’d put the eight most frequent letters on the top row, something like the following [2].

       1234567890
      +----------
      |  ETAOINSR
     1|BCDFGHJKLM
     2|PQUVWXYZ./

This would be more efficient but less secure since the arrangement of the letters would be more predictable.

Security

The straddling checkerboard presents a bit of a challenge to the cryptanalyst since it’s not know a priori whether a digit is part of a pair (if the vertical coordinates are not always 1 and 2).

The straddling checkerboard didn’t offer much security even in its day. It would have been better if there had been some further processing done on the digits, such as how the ADFGV cipher permuted its coordinates.

The message, interpreted as a number N, could have been further encrypted as aN + b where a and b were randomly chosen numbers that were part of the key. As far as I know, nothing like this was ever done. This would have provided more security but would also require more effort and increase the chance of introducing errors.

Related posts

[1] David Kahn. The Codebreakers. Chapter 18.

[2] You may have expected the last letter on the first row to be H, going by the printer’s order ETAOIN SHRDLU. Peter Norvig discovered a slightly different order of letter frequencies based on the Google corpus.

Email subscription changes

I will soon be discontinuing the email subscription option for this blog. I recommend that email subscribers switch over to subscribing to the RSS feed for the blog. If you’re unfamiliar with RSS, here is an article on how to get started.

(I recommend RSS in general, and not just for subscribing to this blog. RSS lets you decide what sources you want to receive rather than leave the choice up to someone else’s algorithm.)

Update: I have replaced my email subscription service with a Substack account.

You can also find me on Twitter and on Mastodon at johndcook@mathstodon.xyz.