Test whether a large integer is a square

Years ago I wrote about a fast way to test whether an integer n is a square. The algorithm rules out a few numbers that cannot be squares based on their last (hexadecimal) digit. If the the integer passes through this initial screening, the algorithm takes the square root of n as a floating point number, rounds  to an integer, and tests whether the square of the rest equals n.

What’s wrong with the old algorithm

The algorithm described above works fine if n is not too large. However, it implicitly assumes that the integer can be represented exactly as a floating point number. This is two assumptions: (1) it’s not too large to represent, and (2) the representation is exact.

A standard 64-bit floating point number has 1 bit for the sign, 11 bits for the exponent, and 52 bits for the significand. (More on that here.) Numbers will overflow if they run out of exponent bits, and they’ll lose precision if they run out of significand bits. So, for example, 21024 will overflow [1]. And although 253 + 1 will not overflow, it cannot be represented exactly.

These are illustrated in Python below.

    >>> float(2**1024)
    OverflowError: int too large to convert to float
    >>> n = 2**53
    >>> float(n) == float(n + 1)
    True

A better algorithm

The following algorithm [2] uses only integer operations, and so it will work exactly for arbitrarily large integers in Python. It’s a discrete analog of Newton’s square root algorithm.

    def intsqrt(N):
        n = N
        while n**2 > N:
            n = (n + N//n) // 2
        return n

    def issquare(N):
        n = intsqrt(N)
        return n**2 == N

The function issquare will test whether N is a square. The function intsqrt is broken out as a separate function because it is independently useful since it returns ⌊√N⌋.

The code above correctly handles examples that the original algorithm could not.

    >>> issquare(2**1024)
    True
    >>> issquare(2**54)
    True
    >>> issquare(2**54 + 1)
    False

You could speed up issquare by quickly returning False for numbers that cannot be squared on account of their final digits (in some base—maybe you’d like to use a base larger than 16) as in the original post.

[1] The largest allowed exponent is 1023 for reasons explained here.

[2] Bob Johnson. A Route to Roots. The Mathematical Gazette, Vol. 74, No. 469 (Oct., 1990), pp. 284–285

Why use hash puzzles for proof-of-work?

A couple days ago I wrote about the the problem that Bitcoin requires to be solved as proof-of-work. In a nutshell, you need to tweak a block of transactions until the SHA256 double hash of its header is below a target value [1]. Not all cryptocurrencies use proof of work, but those that do mostly use hash-based puzzles.

Other cryptocurrencies use a different hashing problem, but they still use hashes. Litecoin and Dogecoin use the same proof-of-work problem, similar to the one Bitcoin uses, but with the scrypt (pronounced S-crypt) hash function. Several cryptocurrencies use a hashing problem based on Equihash. Monero uses its RandomX algorithm for proof-of-work, and although this algorithm has multiple components, it ultimately solves a hashing problem. [2]

Why hash puzzles?

Why do cryptocurrencies use hashing problems for proof of work? In principle they could use any computational problem that is hard to solve but easy to verify, such as numerous problems in computational number theory.

One reason is that computer scientists are confident that quantum computing would not reduce the difficulty of solving hash puzzles, even though it would reduce the difficulty of factoring-based puzzles. Also, there is general agreement that it’s unlikely a mathematical breakthrough will find a weakness in hashing functions.

Ian Cassels said “Cryptography is a mixture of mathematics and muddle, and without the muddle the mathematics can be used against you.” Hashing is much more muddle than mathematics.

Why not do something useful?

Hash puzzles work well for demonstrating work done, but they’re otherwise useless. They keep the wheels of cryptocurrencies turning, but the solutions themselves are intrinsically worthless.

Wouldn’t it be nice if crypto miners were solving useful problems like protein folding? You could do that. In fact there is a cryptocurrency FoldingCoin that does just that. But FoldingCoin has a market cap seven orders of magnitude smaller than Bitcoin, on the order of $200,000 compared to Bitcoin’s market cap of $2T.

Cryptocurrencies that use proof of useful work have not taken off. This might necessarily be the case. Requiring practical work creates divergent incentives. If you base a currency on the difficulty of protein folding computations, for example, it would cause major disruption if a pharmaceutical company decided to get into mining protein folding cryptocurrency at a loss because it values the results.

Going back to Cassels’ remark about mathematics and muddle, practical real-world problems often have a mathematical structure. Which is a huge blessing, except when you’re designing problems to be hard. Hash-based problems have gradually become easier to solve over time, and cryptocurrencies have adjusted. But a mathematical breakthrough for solving a practical problem would have the potential to disrupt a currency faster than the market could adapt.

Related posts

[1] You don’t change the transaction amounts, but you may change the order in which the transactions are arranged into a Merkle tree so that you get different hash values. You can also change a 32-bit nonce, and a timestamp, but most of the degrees of freedom you need in order to find an acceptable hash comes from rearranging the tree.

[2] Both scrypt and Equihash were designed to be memory intensive and to thwart the advantage custom ASIC mining hardware. However, people have found a way to use ASIC hardware to solve scrypt and Equihash problems. RandomX requires running a randomly generated problem before hashing the output in an attempt to frustrate efforts to develop specialized mining hardware.

What is the Bitcoin proof-of-work problem?

In order to prevent fraud, anyone wanting to add a block to the Bitcoin blockchain must prove that they’ve put in a certain amount of computational work. This post will focus on what problem must be solved in order produce proof of work.

You’ll see the proof of work function described as finding strings whose SHA256 hash value begins with a specified number of 0s. That’s sort of a zeroth-level approximation of the problem.

The string s you’re trying to find has the form data + nonce where the data comes from the block you’re wanting to add and the nonce is a value you concatenate on the end. You try different values until you get an acceptable hash value.

You’re not computing the SHA256(s) but rather the double hash:

SHA256²(s) = SHA256( (SHA256(s) )

The only way to find such a string s is by brute force [1], and applying the hash function twice doubles the amount of brute force work needed.

And you’re not exactly trying to produce leading zeros; you’re trying to produce a value less than a target T. This is roughly the same thing, but not quite.

To illustrate this, suppose you have a 2FA fob that generates six-digit random numbers. You’ve been asked to wait until your fob generates a number less than 2025. Waiting until you have three leading zeros would be sufficient, but that would be making the task harder than it needs to be. You’d be waiting for a number less than 1000 when you’re only asked to wait for a number less than 2025.

A SHA256 hash value is a 256-bit number. If your target T is a power of 2

T = 2256−n

then finding a value of s such that

SHA256²(s) < T

really is finding an s whose (double) hash begins with n zeros, though T is not required to be a power of 2.

Finding a value of s with

SHA256²(s) < 2256−n

will require, on average, testing 2n values of s.

The value of T is adjusted over time in order to keep the amount of necessary work roughly constant. As miners have become more efficient, the value of T has gone down and the amount of work has gone up. But the value of T can go up as well. It is currently fluctuating around 2176, i.e. hashes must have around 80 leading zero bits.

Now here’s where things get a little more complicated. I said at the beginning that the string s has the form

s = data + nonce

where the data comes from the block you’re trying to add and the nonce is a number you twiddle in order to get the desired hash value. But the nonce is a 32-bit integer. If you need to hash on the order of 280 strings in order to find one with 80 leading zeros, you can’t do that just by adjusting a 32-bit nonce.

In practice you’re going to have to twiddle the contents of what I’ve called data. The data contains a Merkle tree of transactions, and you can change the hash values by adjusting the order in which transactions are arranged in the tree, in addition to adjusting the nonce.

Related posts

[1] Unless someone finds a flaw in SHA256, which cryptographers have been trying to do for years and have not been able to do. And if a significant weakness is found in SHA256, it may not translate into a significant flaw in SHA256².

Naively computing sine

Suppose you need to write software to compute the sine function. You were told in a calculus class that this is done using Taylor series—it’s not, but that’s another story—and so you start writing code to implement Taylor series.

How many terms do you need? Uh, …, 20? Let’s go with that.

from math import sin, factorial

def sine(x):
    s = 0
    N = 20
    for n in range(N//2):
        m = 2*n + 1
        s += (-1)**n * x**m / factorial(m)
    return s

You test your code against the built-in sin function and things look good.

    >>> sin(1)
    0.8414709848078965
    >>> sine(1)
    0.8414709848078965

Larger arguments

Next, just for grins, you try x = 100.

    >>> sin(100)
    -0.5063656411097588
    >>> sine(100)
    -7.94697857233433e+20

Hmm. Off by 20 orders of magnitude.

So you do a little pencil and paper calculation and determine that the code should work if you set N = 300 in the code above. You try again, and now you’re off by 25 orders of magnitude. And when you try entering 100.0 rather than 100, the program crashes.

What’s going on? In theory, setting N = 300 should work, according to the remainder term in the Taylor series for sine. But the largest representable floating point number is on the order of 10308 and so x**m terms in the code overflow. The reason the program didn’t crash when you typed in 100 with no decimal point is that Python executed x**m in integer arithmetic.

If you were patient and willing to use a huge number of terms in your Taylor series, you couldn’t include enough terms in your series to get accurate results for even moderately large numbers because overflow would kill you before you could get the series truncation error down. Unless you used extended precision floating point. But this would be using heroic measures to rescue a doomed approach.

Taylor series are good local approximations. They’re only valid within a radius of convergence; outside of that radius they’re worthless even in theory. Inside the radius they may be valid in theory but not in practice, as is the case above.

Range reduction

Taylor series isn’t the best approach for evaluating sine numerically, but you could make it work if you limited yourself to a small range of angles, say 0 to π/4, and used trig identities to reduce the general problem to computing sine in the range [0, π/4]. So how do you do that?

Let’s look at reducing arguments to live in the interval [0, 2π]. In practice you’d want to reduce the range further, but that requires keeping up with what quadrant you’re in etc. We’ll just consider [0, 2π] for simplicity.

The natural thing to do is subtract off the largest multiple of 2π you can.

    >>> from math import pi
    >>> k = int(100/(2*pi))
    >>> x = 100 - k*2*pi
    >>> sine(x)
    -0.5065317636696883
    >>> sin(100)
    -0.5063656411097588

That’s better. Now we don’t overflow, and we get three correct decimal places. We could increase N and get more. However, there’s a limit to how well we could do, as the calculation below shows.

    >>> sin(x)
    -0.5063656411097495
    >>> sin(100)
    -0.5063656411097588

When we reduce 100 mod 2π and call sin(x), we get a result that differs from sin(100) in the last three decimal places. That’s not terrible, depending on the application, but the loss of precision increases with the size of x and would be a complete loss for large enough values of x.

Let’s look back at where we divided 100 by 2π:

    >>> 100/(2*pi)
    15.915494309189533

When we throw away the integer part (15) and keep the rest (.915494309189533) we lose two figures of precision. The larger the integer part, the more precision we lose.

Our naive attempt at range reduction doesn’t work so well, but there are clever ways to make range reduction work. It is possible, for example, to compute the sine of a googol (i.e. sin 10100).

At first it seems simple to reduce the range. Then when you become aware of the precision problems and think about it a while, it seems precise range reduction is impossible. But if you keep thinking about it long enough you might find a way to make it work. A lot of effort went into range reduction algorithms years ago, and now they’re part of the computational infrastructure that we take for granted.

Computing the Euler-Mascheroni Constant

The Euler-Mascheroni constant is defined as the limit

\gamma = \lim_{n\to\infty} \left(\sum_{k=1}^n \frac{1}{k} \right) - \log n

So an obvious way to try to calculate γ would be to evaluate the right-hand side above for large n. This turns out to not be a very good approach. Convergence is slow and rounding error accumulates.

A much better approach is to compute

\gamma = \lim_{n\to\infty} \left(\sum_{k=1}^n \frac{1}{k} \right) - \log\left( n + \tfrac{1}{2}\right)

It’s not obvious that you can add the extra factor of ½ in the log term, but you can: the right-hand sides of both equations above converge to γ, but the latter converges faster.

Here’s a comparison of the two methods.

The error in the first method with n = 4000 is comparable to the error in the second method with n = 16.

Even though the second equation above is better for numerical evaluation, there are much more sophisticated approaches. And this brings us to y-cruncher, software that has been used to set numerous world records for computing various constants. A new record for computing the digits of π was set a few weeks ago using y-cruncher. From the y-cruncher documentation:

Of all the things that y-cruncher supports, the Euler-Mascheroni Constant is the slowest to compute and by far the most difficult to implement. …

The Euler-Mascheroni Constant has a special place in y-cruncher. It is one of the first constants to be supported (even before Pi) and it is among the first for which a world record was set using y-cruncher. As such, y-cruncher derives its name from the gamma symbol for the Euler-Mascheroni Constant.

Posts involving γ

The bad version of a good test

Ever since 1952, the largest known primes have all had the form 2n − 1, with one exception in 1989. The reason the largest known primes have this form is that it is easier to test whether these numbers are prime than other numbers.

A number of the form 2n − 1 is called a Mersenne number, denoted Mn. A Mersenne prime is a Mersenne number that is also prime. There is a theorem that says that if Mn is prime then n must be prime.

Lehmer’s theorem

Derrick Henry Lehmer proved in 1930 that Mp is prime if p is an odd prime and Mp divides sp−2 where the s terms are defined recursively as follows. Define s0 = 4 and sn = sn−1² − 2 for n > 0. This is the Lucas-Lehmer primality test for Mersenne numbers, the test alluded to above.

Let’s try this out on M7 = 27 − 1 = 127. We need to test whether 127 divides s5. The first six values of sn are

4, 14, 194, 37634, 1416317954, 2005956546822746114

and indeed 127 does divide 2005956546822746114. As you may have noticed, s5 is a big number, and s6 will be a lot bigger. It doesn’t look like Lehmer’s theorem is going to be very useful.

The missing piece

Here’s the idea that makes the Lucas-Lehmer [1] test practical. We don’t need to know  sp−2 per se; we only need to know whether it is divisible by Mp. So we can carry out all our calculations mod Mp. In our example, we only need to calculate the values of sn mod 127:

4, 14, 67, 42, 111, 0

Much better.

Lucas-Lehmer test takes the remainder by Mp at every step along the way when computing the recursion for the s terms. The naive version does not, but computes the s terms directly.

To make the two versions of the test explicit, here are Python implementations.

Naive code

def lucas_lehmer_naive(p):
    M = 2**p - 1
    s = 4
    for _ in range(p-2):
        s = (s*s - 2) 
    return s % M  == 0

Good code

def lucas_lehmer(p):
    M = 2**p - 1
    s = 4
    for _ in range(p-2):
        s = (s*s - 2) % M
    return s == 0

The bad version

How bad is the naive version of the Lucas-Lehmer test?

The OEIS sequence A003010 consists of the sn. OEIS gives the following formula:

s_n = \left\lceil (2 + \sqrt{3})^{2^n}\right\rceil

This shows that the sn grow extremely rapidly, which suggests the naive algorithm will hit a wall fairly quickly

When I used the two versions of the test above to verify that the first few Mersenne primes Mp really are prime, the naive version gets stuck after p = 19. The better version immediately works through p = 9689 before slowing down noticeably.

Historical example

Let’s look at M521. This was the first Mersenne number verified by a computer to be prime. This was done on the vacuum tube-based SWAC computer in 1952 using the smart version of the Lucas-Lehmer test.

Carrying out the Lucas-Lehmer test to show that this number is prime requires doing modular arithmetic with 521-bit numbers, which was well within the 9472 bits of memory in the vacuum tube-based SWAC computer that proved M521 was prime.

But it would not be possible now, or ever, to verify that M521 is prime using the naive version of the Lucas-Lehmer test because we could not store, much less compute, s519.

Since

\log_{2} s_{519} = 2^{519} \log_{2} (2 + \sqrt{3}) \approx 2^{520}

we’d need around 2520 bits to store s519. The number of bits we’d need is itself a 157-digit number. We’d need more bits than there are particles in the universe, which has been estimated to be on the order of 1080.

The largest Mersenne prime that the SWAC could verify using the naive Lucas-Lehmer test would have been M13 = 8191, which was found to be prime in the Middle Ages.

Related posts

[1] Édouard Lucas conjectured in 1878 the theorem that Lehmer proved in 1930.

Multiplying a matrix by its transpose

An earlier post claimed that there practical advantages to partitioning a matrix, thinking of the matrix as a matrix of matrices. This post will give an example.

Let M be a square matrix and suppose we need to multiply M by its transpose MT. We can compute this product faster than multiplying two arbitrary matrices of the same size by exploiting the fact that MMT will be a symmetric matrix.

We start by partitioning M into four blocks

M = \begin{bmatrix}A & B \\ C & D \end{bmatrix}

Then

M^\intercal = \begin{bmatrix}A^\intercal & C^\intercal \\ B^\intercal & D^\intercal \end{bmatrix}

and

MM^\intercal = \begin{bmatrix} AA^\intercal  + BB^\intercal & AC^\intercal  + BD^\intercal \\CA^\intercal + DB^\intercal & CC^\intercal + DD^\intercal \end{bmatrix}

Now for the first clever part: We don’t have to compute both of the off-diagonal blocks because each is the transpose of the other. So we can reduce our calculation by 25% by not calculating one of the blocks, say the lower left block.

And now for the second clever part: apply the same procedure recursively. The diagonal blocks in MMT involve a matrix times its transpose. That is, we can partition A and use the same idea to compute AAT and do the same for BBT, CCT, and DDT. The off diagonal blocks require general matrix multiplications.

The net result is that we can compute MMT in about 2/3 the time it would take to multiply two arbitrary matrices of the same size.

Recently a group of researchers found a way to take this idea even further, partitioning a matrix into a 4 by 4 matrix of 16 blocks and doing some clever tricks. The RXTX algorithm can compute MMT in about 26/41 the time required to multiply arbitrary matrices, a savings of about 5%. A 5% improvement may be significant if it appears in the inner loop of a heavy computation. According to the authors, “The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial Optimization.”

Related posts

A bit-twiddling marvel

Pop quiz: what does the following code do?

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

It determines whether the year y is a leap year in the Gregorian calendar, of course. :)

It’s very efficient, though I don’t image that would ever matter. But it’s very clever.

Gregorian vs Julian calendar

A year is a leap year in the Julian calendar if and only if it is a multiple of 4. But the Julian year is a bit too long on average to match the earth’s orbit around the sun. You get a much better fit if you add the complication that years divisible by 100 but not by 400 are not leap years.

Presumably no one reading this recalls 1900 not being a leap year, but some of you may need to know some day that 2100 is not a leap year either.

C code

The cryptic function above comes from the recent post A leap year check in three instructions by Falk Hüffner. The function is correct for the next 100 thousand years (more precisely, through 103499) and correct if we anachronistically extent the Gregorian calendar back to the year 0.

The following C code shows empirically that Hüffner’s code is correct, but you’ll have to read his post to understand why it is correct.

#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>

bool is_leap_year_fast(uint32_t y) {
    return ((y * 1073750999) & 3221352463) <= 126976;
}

bool is_leap_year(uint32_t y) {
    if ((y % 4) != 0) return false;
    if ((y % 100) != 0) return true;
    if ((y % 400) == 0) return true;
    return false;
}

int main() {
    for (uint32_t y = 0; y <= 102499; y++)
        if (is_leap_year_fast(y) != is_leap_year(y))
            printf("Exception: %d\n", y);
    return 0;
}

Related posts

Decimal Separator and Internationalization

This morning I ran across the following tip from Joost Helberg on Mastodon:

TIL don’t report numbers with three digits after the decimal point. People may interpret the decimal point as a thousands separator. Using 2 or 4 digits, although wrong, avoids off by a factor thousand errors.

I usually report four decimal places, but I hadn’t thought about that in relation to the decimal separator problem.

In software development, it’s best to let a library handle numeric input and output, using local conventions. Failure to do so can lead to problems, as I’ll never forget.

Embarrassed in Bordeaux

In 2006, Peter Thall and I gave a week-long course on Bayesian clinical trial design in Bordeaux, France. Part of the course was presenting software that my team at MD Anderson Cancer Center had developed.

A few minutes into my first presentation I realized the software wasn’t working for the course attendees. The problem had to do with the US and France using opposite conventions for decimal separator and thousands separator. I had tested our software on a French version of Windows, but I had entered integers in my testing and decimals in my presentation.

I apologized and asked my French audience to enter decimals in the American style, such as 3.14 rather than 3,14. But that didn’t work either!

We were using a Windows API for parsing input, which correctly handles input and output per local conventions. But we had written custom validation code. We checked that the input fields contained only valid numeric characters, i.e. digits and periods. Oops!

Users were between a rock and hard place. The input validation would not accept French notation, and the parsing code would not accept American notation.

The solution was for the attendees to set their operating system locale to the US. They were gracious about having to apply the hack and said that it was a common problem. It was a humiliating way to start the course, but the rest of the week went well.

Formulating eight queens as a SAT problem

The Boolean satisfiability problem is to determine whether there is a way to assign values to variables in a set of Boolean formulas to make the formulas hold [1]. If there is a solution, the next task would be to enumerate the solutions.

You can solve the famous eight queens problem, or its generalization to n-queens, by formulating the problem as a Boolean formula then using a SAT solver to find the solutions.

It’s pretty obvious how to start. A chessboard is an 8 by 8 grid, so you have a variable for each square on the board, representing whether or not that square holds a queen. Call the variables bij where i and j run from 1 to 8.

The requirement that every row contains exactly one queen can be turned into two subrequirements:

  1. Each row contains at least one queen.
  2. Each row contains at most one queen.

The first requirement is easy. For each row i, we have a clause

bi1bi2bi3 ∨ … ∨ bi8

The second requirement is harder. How do you express in terms of our boolean variables that there is no more than one queen in each row? This is the key difficulty. If we can solve this problem, then we’re essentially done. We just need to do the analogous thing for columns and diagonals. (We don’t require a queen on every diagonal, but we require that there be at most one queen on every diagonal.)

First approach

There are two ways to encode the requirement that every row contain at most one queen. The first is to use implication. If there’s a queen in the first column, then there is not a queen in the remaining columns. If there’s a queen in the second column, then there is not a queen in all but the second column, etc. We have an implication for each row in each column. Let’s just look at the first row and first column.

b11 ⇒ ¬ (b12b13 ∨ … ∨ b18)

We can turn an implication of the form a ⇒ b into the clause ¬a ∨ b.

Second approach

The second way to encode the requirement that every row contain at most one queen is to say that for every pair of squares in a row (ab) either a has no queen or b has no queen. So for the first row we would have 8C2 = 28 clauses because there are 28 ways to choose pairs from a set of 8 things.

b11 ∨ ¬b12) ∧ (¬b11 ∨ ¬b13) ∧ … ∧ (¬b17 ∨ ¬b18)

An advantage of this approach is that it directly puts the problem into conjunctive normal form (CNF). That is, our formula is a conjunction of terms that contain only disjunctions, an AND of ORs.

Related posts

[1] You’ll see the SAT problem described as finding the solution to a Boolean formula. If you have multiple formulas, then the first holds, and the second, etc. So you can AND them all together to make one formula.