Cosmic rays flipping bits

Posted on 20 May 2019 by John

cosmic ray

A cosmic ray striking computer memory at just the right time can flip a bit, turning a 0 into a 1 or vice versa. While I knew that cosmic ray bit flips were a theoretical possibility, I didn’t know until recently that there had been documented instances on the ground [1].

Radiolab did an episode on the case of a cosmic bit flip changing the vote tally in a Belgian election in 2003. The error was caught because one candidate got more votes than was logically possible. A recount showed that the person in question got 4096 more votes in the first count than the second count. The difference of exactly 2¹² votes was a clue that there had been a bit flip. All the other counts remained unchanged when they reran the tally.

It’s interesting that the cosmic ray-induced error was discovered presumably because the software quality was high. All software is subject to cosmic bit flipping, but most of it is so buggy that you couldn’t rule out other sources of error.

Cosmic bit flipping is becoming more common because processors have become smaller and more energy efficient: the less energy it takes for a program to set a bit intentionally, the less energy it takes for radiation to set a bit accidentally.

Related post: Six sigma events

[1] Spacecraft are especially susceptible to bit flipping from cosmic rays because they are out from under the radiation shield we enjoy on Earth’s surface.

12 thoughts on “Cosmic rays flipping bits”

Derek Jones

20 May 2019 at 19:25

For a processor with 4G of ram, I once calculated a bit-flip every 33 hours: http://shape-of-code.coding-guidelines.com/2011/11/07/compiling-to-reduce-the-impact-of-soft-errors-on-program-output/ . Fabrication process size has been reduced since then, so the rate has bit-flip probably increased, all a consequence of Moore’s law: http://shape-of-code.coding-guidelines.com/2013/12/13/unreliable-cpus-and-memory-the-end-result-of-moores-law/

In practice many bit-flips occur in unused memory, or have no practical effect, e.g., a bit flip only affects (x < 1000) if it changes it to a value less/greater than 1000. The studies read a while ago had an actual impact occurring in 10-20% of (artificially inserted) bit-flips
Ben Bradley

20 May 2019 at 20:07

I’ve heard of the cosmic ray problem related to spacecraft, but there have also been “sea level” events that act the same way, but ended up having a more earthly cause. Intel’s 16kbit DRAMs had an unusually high rate of random bit-flip errors. The chips were put in ceramic packages made using water downstream from an old Uranium mine. I haven’t found the original paper (“May and Woods of Intel”), but some googling finds a few references to the problem such as this (sections 1.1 through 1.3):
http://the-eye.eu/public/Books/Electronic%20Archive/Soft_Errors_in_Modern_Electronic_Systems.pdf
Paul Clapham

20 May 2019 at 21:55

That happened to me once. Back in the 1990s, the company that I worked for had a large custom software system which did inventory and billing and accounting and you name it. One day, one batch of invoices came out “weird”. It looked like one particular branch point was sending the code execution the wrong way, so that one particular type of item failed to be included in the invoices.

I happened to be in the office late so it was my job to undo that batch of invoices and have them rerun. When they were rerun, the invoices all came out correctly. The problem never happened again (and we invoiced millions of line items per week) and nobody could ever produce a plausible reason why the problem occurred.

My guess was that a random bit-flip had happened in the hardware instruction which corresponded to the failing branch point, but of course that could never be confirmed.
Alessandro

21 May 2019 at 00:22

A ten years old paper by Google says that: “We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode.”

SCHROEDER, Bianca; PINHEIRO, Eduardo; WEBER, Wolf-Dietrich. DRAM errors in the wild: a large-scale field study. In: ACM SIGMETRICS Performance Evaluation Review. ACM, 2009. p. 193-204.

https://storage.googleapis.com/pub-tools-public-publication-data/pdf/35162.pdf
Charlie Harrison

21 May 2019 at 10:32

Are cosmic ray bit flips going to lead to a ‘rise of the machines’ scenario one day when a bit is flipped and suddenly makes an experimental AI become sentient and not so benevolent?!!? LOL!!!

Might make for a neat plot device in a B Movie down the road if a Hollywood Producer ever comes across this blog entry…

Happy Monday everyone!
Chris Johnson

21 May 2019 at 10:57

This is why error checking and correction schemes, such as SECDED (single error correction, double error detection) were invented in the 1970s.
David Janke

21 May 2019 at 11:30

@Charlie Harrison that’s why I always use at least 4 bits for my “isSentient” flags. Should probably start doing the same of “isEvil”.
BobC

21 May 2019 at 23:59

I was once a mentor to an undergraduate project to build a small swarm of 5 sensor satellites on essentially zero budget (but we did have a free ride to LEO). The lack of budget meant we couldn’t use certified rad-hard chips (which can cost hundreds of dollars to tens of thousands of dollars each).

So, we had to determine if a candidate commercial chip had high resistance to (low susceptibility to) cosmic ray effects. How would you do it?

Even in orbit, cosmic ray hits are far from continuous, but can still be a significant hazard to even relatively short missions having high electronics content.

We can’t generate cosmic rays on Earth (and we still lack convincing evidence of how they’re generated in space). But we can simulate them!

Instead of using a small particle traveling infinitesimally below the speed of light, we can use much heavier particles (atoms). But to get to cosmic ray energies, even massive ions must have relativistic velocities. And for the experiment to conclude within a reasonable amount of time, you need lots of them, a beam of them.

It’s not like CERN will let us put chips in the beam when they’re running lead ions (instead of protons) in the LHC. But Brookhaven National Labs has their Tandem Van de Graaff generators that can vaporize, ionize and accelerate your choice of dozens of atomic species to the needed velocities.

But these heavy ions deposit their energy over extremely short distances. So the chip must be “naked” to the beam, meaning the upper portion of its package must be removed, a process called “de-lidding”, which for plastic packages involves fuming nitric acid (friendly stuff). Metal and ceramic encased chips can be de-lidded using a Dremel.

OK, we have our de-lidded chips, we have our ion source, and the only thing that remains is to choose the ion flux, the number of ions per second hitting the die area of our chip.

What flux to choose? How?

In my case, since we were testing microcontrollers, we wanted to be able to cycle power and reboot between hits. Normally this would take seconds, since you “de-rate” everything for space, which in our case meant the CPU clock was running at half it’s rated speed. Which, at the time, meant our 6 MHz processor was running at 3 MHz.

This in turn meant the experiment really was about detecting and recovering from hits: We used a very short duration watchdog timer combined with a very sensitive current detector (ion and cosmic ray hits often create conductive ion channels in silicon that act like a short circuit). We used the microcontroller’s own ADC to do the measurement, with a simple (and inherently rad-hard) analog current sensor and amplifier monitoring the supply current.

After crunching the numbers, it turned out that to get a statistically meaningful number of hits over each of several atomic species within the limited beam time we had purchased, the processor needed to be ready for the next hit less than 100ms after the current hit.

The processor had a serial link for telemetry, so we’d be able to record the hits and recoveries (or lack thereof). For a maximum hit rate, we wanted to be very sure we’d get at least one telemetry packet across the link between hits. Which meant our operational cycle time had to be halved to 50 ms, and a compact binary format had to be used for the telemetry data.

Even all these years later, that software still represents a very proud achievement. Especially since it was completed just 2 hours before the first beam reservation.

One more wrinkle: The beam is in a hard vacuum. As must be our chip and the board carrying it. A hard failure meant a long down-time to swap boards so the experiment could proceed. So we put two boards in the chamber on a stage that could remotely be flipped 180 degrees.

The two boards monitored each other, since if the processor turned out to not be as “rad-hard” as we hoped, we’d be running multiple instances connected by voting logic.

We got the data, and the processor was indeed more than sufficiently rad-hard for our needs.

But we had a strong suspicion right from the start that this would be the case! Rad-hard chips have enormous amounts of documentation (which is what you are purchasing). That documentation includes the full identity and characterization of the wafer production line used to create the chip.

So we did some sleuthing to learn what else was made on that line, and indeed several microprocessors and microcontrollers were! We selected and tested the one best suited to our needs.

We felt no need to have an alternate selection.
Phil Wilkins

23 May 2019 at 19:22

We get core dumps when our application crashes. There’s about 10 million installs across the globe. Runs on consumer grade hardware.

I’ve seen maybe a couple of dozen of these inexplicable bit flips. In some cases I’ve seen the value in memory, in the register, and been a couple of instructions on from the copy. Sometimes it’s a hardware problem, and we’ll see a bunch from the same machine. But a lot of them are completely unique, and never occur again. They get tagged with “Cosmic Rays” so that production know not to try and raise issues from them.
david lewis

18 January 2021 at 14:46

FIT rates are also decreasing since the ray has surpassed the critical charge, so the cell always flips if it is hit. But the cells are getting smaller so less likely to be hit by a ray. Compounding that is the fact that a hit can take out multiple cells. But there are a finite number of rays per unit area per time, so the number of discrete events, not counting for multiplicity, cannot exceed a certain number per area per time.
Rich

14 April 2021 at 21:00

BobC just came across your reply here and was fascinated. Every right to be proud of that, fantastic application of theoretical physics and mathematics at work!
Paul Becker

21 September 2021 at 22:10

I remember it in RAM very early on. In the mid 1970’s, bit flips for sure happened in Intel 2107 DRAMs. Not only can this happen from ceramic lids, it can also happen from lead isotopes present in solder.

Comments are closed.