ChaCha RNG with fewer rounds

ChaCha is a CSPRING, a cryptographically secure pseudorandom number generator. When used in cryptography, ChaCha typically carries out 20 rounds of its internal scrambling process. Google’s Adiantum encryption system uses ChaCha with 12 rounds.

The runtime for ChaCha is proportional to the number of rounds, so you don’t want to do more rounds than necessary if you’re concerned with speed. Adiantum was developed for mobile devices and so Google wanted to reduce the number of rounds while maintaining a margin of cryptographic safety.

Cryptographer Jean-Philippe Aumasson suggests 8 rounds of ChaCha is plenty. He reports that there is a known attack on ChaCha with 6 rounds that requires on the order of 2116 operations, but that ChaCha with 5 rounds is definitely breakable, requiring on the order of only 216 operations.

Three rounds of ChaCha are not enough [1], but four rounds are enough to satisfy DIEHARDER, PractRand, and TestU01 [2]. This illustrates the gap between adequate statistical quality and adequate cryptographic quality: Four rounds of ChaCha are apparently enough to produce the former but five rounds are not enough to produce the latter.

There doesn’t seem to be any reason to use ChaCha with four rounds. If you don’t need cryptographic security, then there are faster RNGs you could use. If you do need security, four rounds are not enough.

ChaCha with six rounds seems like a good compromise if you want an RNG that is fast enough for general use and that that also has reasonably good cryptographic quality. If you want more safety margin for cryptographic quality, you might want to go up to eight rounds.

What a difference one round makes

One thing I find interesting about random number generation and block encryption is that a single round of obfuscation can make a huge difference. ChaCha(3) fails statistical tests but ChaCha(4) is fine. ChaCha(5) is easily broken but ChaCha(6) is not.

Interesting failures

Often random number generators are either good or bad; they pass the standard test suites or they clearly fail. ChaCha(3) is interesting in that it is somewhere in between. As the results in the footnotes show, ChaCha(3) is an intermediate case. DIEHARDER hints at problems, but small crush thinks everything is fine. The full crush battery however does find problems.

The decisive failure of the Fourier tests is understandable: low-quality generators often fail spectral tests. But the results of the simple poker test are harder to understand. What is it about simulating poker that makes ChaCha(3) fail spectacularly? And in both cases, one more round of ChaCha fixes the problems.

Related posts

[1] ChaCha(3) passes passes DIEHARDER, though five of the tests passed with a “weak” pass. Passes TestU01 small crush but fails full crush:

========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        32-bit stdin
 Number of statistics:  144
 Total CPU time:   00:39:05.05
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
 25  SimpPoker, d = 64                eps
 26  SimpPoker, d = 64                eps
 27  CouponCollector, d = 4          7.7e-4
 28  CouponCollector, d = 4          7.4e-4
 29  CouponCollector, d = 16          eps
 33  Gap, r = 0                      2.7e-8
 51  WeightDistrib, r = 0             eps
 52  WeightDistrib, r = 8           5.2e-15
 53  WeightDistrib, r = 16           2.1e-5
 55  SumCollector                     eps
 69  RandomWalk1 H (L = 10000)       1.1e-4
 74  Fourier3, r = 0               1.3e-144
 75  Fourier3, r = 20               6.9e-44
 80  HammingWeight2, r = 0           2.8e-6
 83  HammingCorr, L = 300           3.2e-10
 84  HammingCorr, L = 1200          6.1e-13
 ----------------------------------------------
 All other tests were passed

ChaCha(3) also fails PractRand decisively.

RNG_test using PractRand version 0.94
RNG = chacha(3), seed = 0x7221236f
test set = core, folding = standard (32 bit)

rng=chacha(3), seed=0x7221236f
length= 128 megabytes (2^27 bytes), time= 2.0 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/32]BCFN(2+0,13-6,T)         R= +31.0  p =  4.7e-11   VERY SUSPICIOUS
  [Low1/32]BCFN(2+1,13-6,T)         R= +27.4  p =  8.3e-10   VERY SUSPICIOUS
  [Low1/32]BCFN(2+2,13-6,T)         R= +52.7  p =  1.8e-18    FAIL !
  [Low1/32]BCFN(2+3,13-6,T)         R= +47.6  p =  9.5e-17    FAIL !
  [Low1/32]BCFN(2+4,13-7,T)         R= +26.1  p =  1.7e-8   very suspicious
  [Low1/32]DC6-9x1Bytes-1           R= +26.3  p =  1.3e-14    FAIL !
  [Low1/32]FPF-14+6/16:all          R=  +8.6  p =  2.2e-7   very suspicious
  ...and 147 test result(s) without anomalies

[2] ChaCha(4) passed TestU01 small crush and full crush. It passed PractRand using up to 512 GB.

Number of forms of a finite field

The number of elements in a finite field must be a prime power, and for every prime power there is only one finite field up to isomorphism.

The finite field with 256 elements, GF(28), is important in applications. From one perspective, there is only one such field. But there are a lot of different isomorphic representations of that field, and some are more efficient to work with that others.

Just how many ways are there to represent GF(28)? Someone with the handle joriki gave a clear answer on Stack Exchange:

There are 28−1 different non-zero vectors that you can map the first basis vector to. Then there are 28−2 different vectors that are linearly independent of that vector that you can map the second basis vector to, and so on. In step k, 2k-1 vectors are linear combinations of the basis vectors you already have, so 28−2k−1 aren’t, so the total number of different automorphisms is

\prod_{k=1}^8\left(2^8 - 2^{k-1} \right ) = 5348063769211699200

This argument can be extended to count the number of automorphism of any finite field.

More finite field posts

Top cryptography posts of 2019

Toward the end of each year I write a post or two listing the most popular posts by category. This year the categories will be a little different. I’ll start by listing my most popular posts about cryptography this year.

The next categories will be command line tools, privacy, and math.

(When I wrote this, I started with crypto because I didn’t think I’d write any more posts on the topic. The the announcement about RSA-240 came out and so I wrote something about it yesterday.)

New RSA factoring challenge solved

How hard is it to factor large numbers? And how secure are encryption methods based on the difficulty of factoring large numbers?

The RSA factoring challenges were set up to address these questions. Last year RSA-230 was factored, and this week RSA-240 was factored. This is a 240 digit (795 bit) number, the product of two primes.

Researchers solved two related problems at the same time, factoring RSA-240 and solving a discrete logarithm problem. Together these problems took about 4,000 core-years to solve. It’s not clear from the announcement how long it would have taken just to factor RSA-240 alone.

If you were to rent the computing power used, I imagine the cost would be somewhere in the six figures.

This makes 2048-bit and 3072-bit RSA keys look very conservative. However, the weakest link in RSA encryption is implementation flaws, not the ability to factor big numbers.

Assume for a moment that breaking RSA encryption requires factoring keys. (This may not be true in theory [*] or in practice.) How long would it take to factor a 2048 or 3072 bit key?

The time required to factor a number n using the number field sieve is proportional to

\exp\left( \left(\sqrt[3]{\frac{64}{9}} + o(1)\right)(\ln n)^{\frac{1}{3}}(\ln \ln n)^{\frac{2}{3}}\right)

Here o(1) roughly means terms that go away as n gets larger. (More on the notation here.) For simplicity we’ll assume we can ignore these terms.

This suggests that factoring a 2048-bit key is 12 orders of magnitude harder than factoring RSA-240, and that factoring a 3072-bit key is 18 orders of magnitude harder.

However, I don’t think anyone believes that breaking RSA with 2048-bit keys would require a quadrillion core-years. If the NSA believed this, they wouldn’t be recommending that everyone move to 3072-bit keys.

Why such a large discrepancy? Here are a few reasons. As mentioned above, RSA encryption often has exploitable implementation flaws. And even if implemented perfectly, there is no proof that breaking RSA encryption is as hard as factoring. And there could be breakthroughs in factoring algorithms. And large-scale quantum computers may become practical, in which case factoring would become much easier.

***

[*] Factoring is sufficient to break RSA, but there’s no proof that it’s necessary. Michael Rabin’s variation on RSA is provably as hard to break as factoring: decryption would enable you to factor the key. But as far as I know, Rabin’s method isn’t used anywhere. Even if you know your method is as hard as factoring, maybe factoring isn’t as hard as it seems. Lower bounds on computational difficulty are much harder to obtain than upper bounds.

Quantum supremacy and post-quantum crypto

Google announced today that it has demonstrated “quantum supremacy,” i.e. that they have solved a problem on a quantum computer that could not be solved on a classical computer. Google says

Our machine performed the target computation in 200 seconds, and from measurements in our experiment we determined that it would take the world’s fastest supercomputer 10,000 years to produce a similar output.

IBM disputes this claim. They don’t dispute that Google has computed something with a quantum computer that would take a lot of conventional computing power, only that it “would take the world’s fastest supercomputer 10,000 years” to solve. IBM says it would take 2.5 days.

Scott Aaronson gets into technical details of the disagreement between Google and IBM. He explains that the supercomputer in question is Oak Ridge National Labs’ Summit machine. It covers the area of two basketball courts and has 250 petabytes of disk storage. By exploiting its enormous disk capacity, Summit could simulate Google’s quantum calculation on classical hardware in “only” two and half days. In a nutshell, it seems Google didn’t consider that you could trade off a gargantuan amount of storage for processor power. But as Aaronson points out, if Google’s machine added just a couple more qubits, even Summit couldn’t keep up on this particular problem.

So does this mean that all the world’s encryption systems are going to fail by the end of the week?

Google selected a problem that is ideal for a quantum computer. And why wouldn’t they? This is the natural thing to do. But they’re not on the verge of rendering public key encryption useless.

Google’s Sycamore processor used 54 qubits. According to this paper, it would take 20,000,000 qubits to factor 2048-bit semiprimes such as used in RSA encryption. So while Google has achieved a major milestone in quantum computing, public key encryption isn’t dead yet.

If and when large-scale quantum computing does become practical, encryption algorithms that depend on the difficulty of factoring integers will be broken. So will algorithms that depend on discrete logarithms, either working over integers or over elliptic curves.

Post-quantum encryption methods, methods that will remain secure even in a world of quantum computing (as far as we know), have been developed but not widely deployed. There’s a push to develop post-quantum methods now so that they’ll be ready by the time they’re needed. Once a new method has been proposed, it takes a long time for researchers to have confidence in it. It also takes a long time to develop efficient implementations that don’t introduce vulnerabilities.

The NSA recommends using existing methods with longer keys for now, then moving to quantum-resistant methods, i.e. not putting any more effort into developing new algorithms that are not quantum-resistant.

Here are some posts I’ve written about post-quantum encryption methods.

Proving that a choice was made in good faith

How can you prove that a choice was made in good faith? For example, if your company selects a cohort of people for random drug testing, how can you convince those who were chosen that they weren’t chosen deliberately? Would a judge find your explanation persuasive? This is something I’ve helped companies with.

It may be impossible to prove that a choice was not deliberate, but you can show good faith by providing evidence that the choice was deliberate by a different criteria than the one in question.

I’ll give four examples, three positive and one negative.

Cliff RNG

My previous three blog posts looked at different aspects of the Cliff random number generator. The generator needs a seed between 0 and 1 to start. Suppose I chose 0.8121086949937715 as my seed. On the one hand, that’s a number with no apparent special features. But you might ask “Hey, why that number?” and you’d be right. I show in the first post in the series how that number was chosen to make the generator start off producing duplicate output.

In the next two posts in the series, I chose π – 3 as my seed. That’s a recognizable number and obviously a deliberate choice. But it has no apparent connection to the random number generator, and so it’s reasonable to assume that the seed wasn’t chosen to make the generator look good or bad.

SHA-2

The SHA-2 cryptographic hash function seventy two 32-bit numbers for initial state that needed to be “random” in some sense. But if the values were actually chosen at random, critics would suspect that the values were chosen to provide a back door. And maybe there is a clever way to pick the initial state that provides a non-obvious exploitable weakness.

The designers of SHA-2 chose the square roots of the first consecutive primes to fill one set of constants, and the cube roots of the first consecutive primes to fill another. See code here.

The initial state is definitely not random. Someone looking at the state would eventually discover where it came from. So while the choice was obviously deliberate, but apparently not designed by any cryptographic criteria.

Curve 25519

Daniel Bernstein’s elliptic curve Curve25519 is widely trusted in part because Bernstein made his design choices transparent. The curve is

y² = x³ + 486662x² + x

over the finite field with 2255-19 elements, hence the name.

2255-19 is the largest prime less than 2255, and being close to 2255 makes the method efficient to implement. The coefficient 48666 is less obvious. But Bernstein explains in his paper that he took the three smallest possible values of this parameter that met the explicit design criteria, and then rejected two of them on objective grounds described at the bottom of the paper.

NIST P-384

The design of elliptic curve NIST P-384 is not as transparent as that of Curve25519 which has lead to speculation that NIST may have designed the curve to have a back door.

The curve has has Weierstrass form

y² = x³ – 3x + b

over the finite field with p elements where

p = 2384 – 2128 – 296 + 232 – 1.

As with Curve25519, the choice of field size was motivated by efficiency; the pattern of powers of 2 enables some tricks for efficient implementation. Also, there are objective reasons why the linear coefficient is -3. But the last coefficient b is the 383-bit number

27580193559959705877849011840389048093056905856361568521428707301988689241309860865136260764883745107765439761230575

which has raised some eyebrows. NIST says the value was chosen at random, though not directly. More specifically, NIST says it first generated a certain 160-bit random number, then applied a common key stretching algorithm to obtain b above. Researchers are divided over whether they believe this. See more in this post.

Conclusion

Sometimes you can’t prove that a choice wasn’t deliberate. In that case the best you can do is show that the choice was deliberate, but by an innocent criteria, one unrelated to the matter at hand.

I tried to do this in the Cliff RNG blog posts by using π as my seed. I couldn’t actually use π because the seed had to be between 0 and 1, but there’s an obvious way to take π and produce a number between 0 and 1.

The designers of SHA-2 took a similar route. Just as π is a natural choice for a real number, primes are natural choices for integers. They couldn’t use integers directly, but they made complicated patterns from simple integers in a natural way by taking roots.

Daniel Bernstein gained the cryptography community’s trust by making his design criteria transparent. Given his criteria, the design was almost inevitable.

NIST was not as persuasive as Daniel Bernstein. They claim to have chosen a 160-bit number at random, and they may very well have. But there’s no way to know whether they generated a lot of 160-bit seeds until they found one that resulted in a constant term that has some special property. They may have chosen their seed in good faith, but they have a not been completely persuasive.

Sometimes it’s not enough to act in good faith; you have to make a persuasive case that you acted in good faith.

Encryption as secure as factoring

RSA encryption is based on the assumption that factoring large integers is hard. However, it’s possible that breaking RSA is easier than factoring. That is, the ability to factor large integers is sufficient for breaking RSA, but it might not be necessary.

Two years after the publication of RSA, Michael Rabin created an alternative that is provably as hard to break as factoring integers.

Like RSA, Rabin’s method begins by selecting two large primes, p and q, and keeping them secret. Their product n = pq is the public key. Given a message m in the form of a number between 0 and n, the cipher text is simply

c = m² mod n.

Rabin showed that if you can recover m from c, then you can factor n.

However, Rabin’s method does not decrypt uniquely. For every cipher text c, there are four possible clear texts. This may seem like a show stopper, but there are ways to get around it. You could pad messages with some sort of hash before encryption so that is extremely unlikely that more than one of the four decryption possibilities will have the right format. It is also possible to make the method uniquely invertible by placing some restrictions on the primes used.

I don’t know that Rabin’s method has been used in practice. I suspect that the assurance that attacks are as hard as factoring integers isn’t valuable enough to inspire people to put in the effort to harden the method for practical use [1].

There’s growing suspicion that factoring may not be as hard as we’ve thought. And we know that if large scale quantum computing becomes practical, factoring integers will be easy.

My impression is that researchers are more concerned about a breakthrough in factoring than they are about a way to break RSA without factoring [2, 3].

Related encryption posts

[1] One necessary step in practical implementation of Rabin’s method would be to make sure m > √n. Otherwise you could recover m by taking the integer square root rather than having to take a modular square root. The former is easy and the latter is hard.

[2] There are attacks against RSA that do not involve factoring, but they mostly exploit implementation flaws. They are not a frontal assault on the math problem posed by RSA.

[3] By “breaktrhough” I meant a huge improvement in efficiency. But another kind of breaktrough is conceivable. Someone could prove that factoring really is hard (on classical computers) by establishing a lower bound. That seems very unlikely, but it would be interesting. Maybe it would spark renewed interest in Rabin’s method.

Beating the odds on the Diffie-Hellman decision problem

There are a couple variations on the Diffie-Hellman problem in cryptography: the computation problem (CDH) and the decision problem (DDH). This post will explain both and give an example of where the former is hard and the latter easy.

The Diffie-Hellman problems

The Diffie-Hellman problems are formulated for an Abelian group. The main group we have in mind is the multiplicative group of non-zero integers modulo a large prime p. But we start out more generally because the Diffie-Hellman problems are harder over some groups than others as we will see below.

Let g be the generator of a group. When we think of the group operation written as multiplication, this means that every element of the group is a power of g.

Computational Diffie-Hellman problem

Let a and b be two integers. Given ga and gb, the CDH problem is to compute gab. Note that the exponent is ab and not a+b. The latter would be easy: just multiply ga and gb.

To compute gab we apparently need to know either a or b, which we are not given. Solving for either exponent is the discrete logarithm problem, which is impractical for some groups.

It’s conceivable that there is a way to solve CDH without solving the discrete logarithm problem, but at this time the most efficient attacks on CDH compute discrete logs.

Diffie-Hellman key exchange

Diffie-Hellman key exchange depends on the assumption that CDH is a hard problem.

Suppose Alice and Bob both agree on a generator g, and select private keys a and b respectively. Alice sends Bob ga and Bob sends Alice gb. This doesn’t reveal their private keys because the discrete logarithm problem is (believed to be) hard. Now both compute gab and use it as their shared key. Alice computes gab by raising gb to the power a, and Bob computes gab by raising ga to the power b.

If someone intercepted the exchange between Alice and Bob, they would possess ga and gb and would want to compute gab. This the the CDH.

When working over the integers modulo a large prime p, CDH is hard if p-1 has a large factor, such as when p – 1 = 2q for a prime q. If p-1 has only small factors, i.e. if p-1 is “smooth”, then the discrete logarithm problem is tractable via the Silver-Pohlig-Hellman algorithm [1]. But if p is large and p-1 has a large prime factor, CDH is hard as far as we currently know.

Decision Diffie-Hellman problem

The DDH problem also starts with knowledge of ga and gb. But instead of asking to compute gab it asks whether one can distinguish gab from gc for a randomly chosen c with probability better than 0.5.

Clearly DDH is weaker than CDH because if we can solve CDH we know the answer to the DDH question with certainty. Still, the DDH problem is believed to be hard over some groups, such as certain elliptic curves, but not over other groups, as illustrated below.

DDH for multiplicative group mod p

Suppose we’re working in the multiplicative group of non-zero integers modulo a prime p. Using Legendre symbols, which are practical to compute even for very large numbers, you can determine which whether ga is a square mod p, which happens if and only if a is even. So by computing the Legendre symbol for ga mod p, we know the parity of a. Similarly we can find the parity of b, and so we know the parity of ab. If ab is odd but gc is a square mod p, we know that the answer to the DDH problem is no. Similarly if ab is even but gc is not a square mod p, we also know that the answer to the DDH problem is no.

Since half the positive integers mod p are squares, we can answer the DDH problem with certainty half the time by the parity argument above. If our parity argument is inconclusive we have to guess. So overall we can answer he DDH problem correctly with probability 0.75.

Related number theory posts

[1] As is common when you have a lot of names attached to a theorem, there were multiple discoveries. Silver discovered the algorithm first, but Pohlig and Hellman discovered it independently and published first.

Twisted elliptic curves

This morning I was sitting at a little bakery thinking about what to do before I check out of my hotel. I saw that the name of the bakery was Twist Bakery & Cafe, and that made me think of writing about twisted elliptic curves when I got back to my room.

Twist of an elliptic curve

Suppose p is an odd prime and let E be the elliptic curve with equation

y² = x³ + ax² + bx + c

where all the variables come from the integers mod p. Now pick some d that is not a square mod p, i.e. there is no x such that x² – d is divisible by p. Then the curve E‘ with equation

dy² = x³ + ax² + bx + c

is a twist of E. More explicit it’s a quadratic twist of E. This is usually what people have in mind when they just say twist with no modifier, but there are other kinds of twists.

Let’s get concrete. We’ll work with integers mod 7. Then the squares are 1, 4, and 2. (If that last one looks strange, note that 3*3 = 2 mod 7.) And so the non-squares are 3, 5, and 6. Suppose our curve E is

y² = x³ + 3x + 2.

Then we can form a twist of E by multiplying the left side by 3, 5, or 6. Let’s use 3. So E‘ given by

3y² = x³ + 3x + 2

is a twist of E.

Twisted curve

There’s something potentially misleading in the term “twisted curve”. The curve E‘ is a twist of E, so E‘ is twisted relative to E, and vice versa. You can’t say E‘ is twisted and E is not.

But you might object that E‘ has the form

dy² = x³ + ax² + bx + c

where d is a non-square, and E does not, so E‘ is the one that’s twisted. But a change of variables can leave the curve the same while changing the form of the equation. Every curve has an equation in which the coefficient of y² is 1 and a form in which the coefficient is a non-square. Specifically,

dy² = x³ + ax² + bx + c

and

y² = x³ + dax² + d²bx + d³c

specify the same curve.

The two curves E and E‘ are not isomorphic over the field of integers mod p, but they are isomorphic over the quadratic extension of the integers mod p. That is, if you extend the field by adding the square root of d, the two curves are isomorphic over that field.

Partitioning x coordinates

For a given x, f(x) = x³ + ax² + bx + c is either a square or a non-square. If it is a square, then

y² = f(x)

has a solution and

dy² = f(x)

does not. Conversely, if f(x) is not a square, then

dy² = f(x)

has a solution and

y² = f(x)

does not. So a given x coordinate either corresponds to a point on E or a point on E‘, but not both.

Application to cryptography

In elliptic curve cryptography, it’s good if not only is the curve you’re using secure, so are its twists. Suppose you intend to work over a curve E, and someone sends you an x coordinate that’s not on E. If you don’t check whether x corresponds to a point on E, you are effectively working on a twist E‘ rather than E. That can be a security hole if E‘ is not as secure a curve as E, i.e. if the discrete logarithm problem on E‘ can be solved more efficiently than the same problem on E.

More elliptic curve posts

Homomorphic encryption

A function that satisfies

f(x*y) = f(x)*f(y)

is called a homomorphism. The symbol “*” can stand for any operation, and it need not stand for the same thing on both sides of the equation. Technically * is the group operation, and if the function f maps elements of one group to another, the group operation may be different in the two groups.

Examples

Below are three examples of homomorphisms that people usually see before learning the term “homomorphism.”

Exponential

For example, the function f(x) = ex is a homomorphism where * stands for addition of real numbers on the left and multiplication of positive real numbers on the right. That is,

ex+y = ex ey.

In words, the exponential of a sum is the product of exponentials.

Determinants

Another example of a homomorphism is determinant. Let A and B be two n by n matrices. Then

det(AB) = det(A) det(B).

In words, the determinant of a product is the product of determinants. But “product” refers to two different things in the previous sentence. The first product is a matrix product and the second product is a product of real numbers.

Fourier transform

One last example of a homomorphism is the Fourier transform F. It satisfies

F(f * g) = F(f) F(g)

where the group operation * on the left is convolution. In words, the Fourier transform of the convolution of two functions is the product of their Fourier transforms.

Homomorphic encryption

Homomorphic encryption is an encryption algorithm that is also a homomorphism. It allows the recipient of encrypted data to encrypt the result of some computation without knowing the inputs. I give three examples below.

Simple substitution

Here’s a trivial example. Suppose you have a simple substitution cipher, i.e. ever letter is simply replaced by some other letter. This is homomorphic encryption where the group operation is string concatenation [1]. If you encrypt two unknown words as xxw and jkh, I know that the encrypted form of the two words stuck together is xxwjkh.

A block cipher like AES is also homomorphic with respect to concatenation, if implemented naively. That is,

E(A + B) = E(A) + E(B)

where + means concatenation. The encrypted form of the concatenation of two blocks is the concatenation of the two blocks encrypted separately.

The approach is called Electronic Code Book or ECB mode. It’s never used in practice. Instead, you might use something like Cipher Block Chaining, CBC mode. Here the encrypted form of one block is XOR’d with the next clear block before encrypting it. [2]

An important point here is that when we say an encryption method is homomorphic, we have to say what operations its homomorphic with respect to. Being homomorphic with respect to concatenation is not useful, and is in fact a weakness.

Goldwasser-Micali

A non-trivial example is the Goldwasser-Micali algorithm. If you’re given the encrypted form of two bits, you can compute an encrypted form of the XOR of the two bits without knowing the value of either bit. Specifically, bits are encoded as large numbers modulo a public key. If c1 and c2 are encrypted forms of bits b1 and b2, then c1 c2 is an encrypted form of

b1b2.

The same bit can be encrypted many different ways, so this isn’t a homomorphism in the simplest mathematical sense. That’s why I said “an encrypted form” rather than “the encrypted form.” This is common in homomorphic encryption.

Paillier

The Paillier encryption algorithm is something like RSA, and is homomorphic with respect to addition of messages. That is, if you encrypt two messages, m1 and m2, and multiply the results, you get something that decrypts to m1 + m2. Here the addition and multiplication are modulo some large integer.

As with the G-M algorithm above, encryption is not unique. Decryption is unique, but there’s a random choice involved in the encryption process. The recipient of the encoded data is able to compute an encrypted form of their sum.

The Paillier algorithm lets you encrypt two numbers and have someone securely add them for you without learning what the two numbers were. That doesn’t sound very exciting in itself, but it’s a useful building block for more useful protocols. For example, you may want to let someone carry out statistical calculations on private data without letting them see the data itself.

Related

[1] Strings with concatenation don’t form a group because there are no inverses. So technically we have a homomorphism of semigroups, not groups.

[2] There’s a famous example of the weakness of ECB mode using the Linux penguin logo. If you encrypt the logo image using ECB mode, you can still recognize the penguin in the encrypted version since repetitive parts of the original image correspond to repetitive parts of the encrypted image. If you use CBC mode, however, the encrypted image becomes white noise.

Encrypted Linux logo