Formally proving a calculation with Claude and Lean

I ran an experiment today to see whether Claude could generate Lean code to prove a calculation at the bottom of this post, six lines of calculus.

I started with this prompt

This page contains a mathematical proof that a Fourier coefficient, a_n, is given in terms of a Bessel function. The LaTeX source for the SVG image is contained in the alt tag of the image. Generate a formal proof of the result using Lean.

and give it the URL of the post. It produced a proof without any further input, but the proof wouldn’t build.

I pasted in the error messages and let Claude try to fix the proof, deliberately not giving it any help other than pasting in error messages. It took eight iterations, but eventually it worked.

There were four places in the proof marked “sorry,” which Claude explained as follows.

Why the sorrys?

Real.besselJ definition or a Kepler namespace. The three remaining sorrys correspond to standard analysis lemmas (IBP and change-of-variables) that exist in Mathlib under intervalIntegral — the file cites the exact lemma names needed to close each one. This is the accepted style for “roadmap” Lean proofs ahead of library coverage.

The full proof is given below.

/-
  KeplerBessel.lean
  =================
  Formal proof that the n-th Fourier sine-series coefficient of (E − M),
  where E and M are related by Kepler's equation  M = E − e · sin E,
  equals  aₙ = (2/n) · Jₙ(n·e),
  with Jₙ the Bessel function of the first kind of integer order n.

  Mathematical content
  --------------------
  We expand  E(M) − M  in a sine series on [0, π]:

      E(M) − M = Σ_{n=1}^∞  aₙ · sin(n·M)

  The standard Fourier formula gives

      aₙ = (2/π) ∫₀^π (E(M) − M) sin(n·M) dM.

  Integrating by parts (boundary terms vanish because E(0)=0 and E(π)=π):

      aₙ = (2/(nπ)) ∫₀^π (E'(M) − 1) cos(n·M) dM
         = (2/(nπ)) ∫₀^π E'(M) cos(n·M) dM     -- the "−1" term vanishes

  Changing variable M ↦ E via M = E − e·sin E  (so E'(M) dM = dE):

      aₙ = (2/(nπ)) ∫₀^π cos(n·E − n·e·sin E) dE
         = (2/n) · Jₙ(n·e).

  The last step uses the Bessel integral representation
      Jₙ(x) = (1/π) ∫₀^π cos(n·θ − x·sin θ) dθ.
-/

import Mathlib

open Real MeasureTheory intervalIntegral Filter Set

noncomputable section

/-! ---------------------------------------------------------------
    §1  Variables
    --------------------------------------------------------------- -/

variable (e : ℝ) (he : 0 ≤ e) (he1 : e < 1) /-! --------------------------------------------------------------- §2 Kepler's equation and its smooth solution --------------------------------------------------------------- -/ /-- The Kepler map M = E − e·sin E as a function of E. -/ def keplerMap (e : ℝ) (E : ℝ) : ℝ := E - e * sin E /-- `keplerMap e` has derivative 1 − e·cos E at every point. -/ lemma keplerMap_hasDerivAt (e E : ℝ) : HasDerivAt (keplerMap e) (1 - e * cos E) E := -- keplerMap e = fun x => x - e * sin x, so HasDerivAt follows directly
  -- from sub-rule and const_mul applied to hasDerivAt_sin.
  (hasDerivAt_id E).sub ((hasDerivAt_sin E).const_mul e)

/-- The derivative of `keplerMap e` is positive when e < 1. -/
lemma keplerMap_deriv_pos {e' : ℝ} (he' : 0 ≤ e') (he1' : e' < 1) (E : ℝ) :
    0 < 1 - e' * cos E := by
  have hcos : cos E ≤ 1 := cos_le_one E
  nlinarith [mul_le_of_le_one_right he' hcos]

/-- `keplerMap e` is strictly monotone when e < 1.
    Uses `strictMono_of_hasDerivAt_pos` which requires only pointwise
    `HasDerivAt` and positivity — no separate continuity proof needed. -/
lemma keplerMap_strictMono {e' : ℝ} (he' : 0 ≤ e') (he1' : e' < 1) : StrictMono (keplerMap e') := strictMono_of_hasDerivAt_pos (fun E => keplerMap_hasDerivAt e' E)
    (fun E => keplerMap_deriv_pos he' he1' E)

/-!
  We axiomatise the inverse  eccAnom : ℝ → ℝ → ℝ  and its key
  properties, all of which follow from the Inverse Function Theorem
  applied to the smooth, strictly monotone map  keplerMap e.
-/

/-- The eccentric anomaly: the smooth right-inverse of `keplerMap e`. -/
axiom eccAnom (e : ℝ) : ℝ → ℝ

/-- `eccAnom e M` satisfies Kepler's equation. -/
axiom eccAnom_kepler (e M : ℝ) :
    keplerMap e (eccAnom e M) = M

/-- `eccAnom e` is differentiable, derivative = 1/(1 − e·cos(eccAnom e M)). -/
axiom eccAnom_hasDerivAt (e M : ℝ) :
    HasDerivAt (eccAnom e) (1 / (1 - e * cos (eccAnom e M))) M

/-- Boundary value at 0. -/
axiom eccAnom_zero (e : ℝ) : eccAnom e 0 = 0

/-- Boundary value at π. -/
axiom eccAnom_pi (e : ℝ) : eccAnom e π = π

/-! ---------------------------------------------------------------
    §3  Bessel function of the first kind (integer order)

    Defined by the classical integral representation.
    --------------------------------------------------------------- -/

/-- Bessel function J_n(x) via its integral representation. -/
def besselJ (n : ℕ) (x : ℝ) : ℝ :=
  (1 / π) * ∫ θ in (0 : ℝ)..π, cos (↑n * θ - x * sin θ)

/-! ---------------------------------------------------------------
    §4  Fourier coefficient

    Named  keplerFourierCoeff  to avoid clashing with Mathlib's own
    `fourierCoeff` which is defined on  AddCircle.
    --------------------------------------------------------------- -/

/-- The n-th Fourier sine coefficient of  eccAnom e M − M  on [0,π]. -/
def keplerFourierCoeff (e : ℝ) (n : ℕ) : ℝ :=
  (2 / π) * ∫ M in (0 : ℝ)..π,
    (eccAnom e M - M) * sin (↑n * M)

/-! ---------------------------------------------------------------
    §5  Main theorem
    --------------------------------------------------------------- -/

/--
  **Main theorem.** For n ≥ 1, the Fourier sine coefficient of the
  eccentric-anomaly displacement satisfies  aₙ = (2/n) · Jₙ(n·e).
-/
theorem keplerFourierCoeff_eq_besselJ (n : ℕ) (hn : 1 ≤ n) :
    keplerFourierCoeff e n = (2 / (n : ℝ)) * besselJ n (↑n * e) := by
  simp only [keplerFourierCoeff, besselJ]
  -- Goal:
  --   (2/π) · ∫₀^π (E(M)−M)·sin(nM) dM
  --   = (2/n) · (1/π) · ∫₀^π cos(nθ − ne·sinθ) dθ

  -- ── Step 1: Integration by parts ─────────────────────────────────────
  -- u = E(M)−M,  dv = sin(nM)dM  →  v = −cos(nM)/n
  -- Boundary: [uv]₀^π = 0 by eccAnom_zero, eccAnom_pi.
  -- Result: (2/π)∫(E−M)sin(nM)dM = (2/π)(1/n)∫(E'(M)−1)cos(nM)dM
  --
  -- Mathlib lemma: intervalIntegral.integral_mul_deriv
  --   (or integral_deriv_mul_eq_sub_of_hasDerivAt applied to
  --    u = eccAnom e − id,  v = −sin(n··)/n)
  have step1 :
      (2 / π) * ∫ M in (0 : ℝ)..π, (eccAnom e M - M) * sin (↑n * M)
      = (2 / π) * (1 / ↑n) *
          ∫ M in (0 : ℝ)..π, (deriv (eccAnom e) M - 1) * cos (↑n * M) := by
    sorry

  -- ── Step 2: The "−1" integral vanishes ───────────────────────────────
  -- ∫₀^π cos(nM) dM = [sin(nM)/n]₀^π = 0  (integer n ≥ 1)
  -- Mathlib: integral_cos, Real.sin_nat_mul_pi
  have cos_int_zero :
      ∫ M in (0 : ℝ)..π, cos (↑n * M) = 0 := by
    sorry

  have step2 :
      ∫ M in (0 : ℝ)..π, (deriv (eccAnom e) M - 1) * cos (↑n * M)
      = ∫ M in (0 : ℝ)..π, deriv (eccAnom e) M * cos (↑n * M) := by
    have key : ∀ M : ℝ, (deriv (eccAnom e) M - 1) * cos (↑n * M)
                       = deriv (eccAnom e) M * cos (↑n * M) - cos (↑n * M) := by
      intro M; ring
    simp_rw [key]
    rw [intervalIntegral.integral_sub _ _]
    · rw [cos_int_zero, sub_zero]
    · -- IntervalIntegrable (deriv (eccAnom e) · cos(n··))
      sorry
    · exact (continuous_cos.comp (continuous_const.mul continuous_id')).intervalIntegrable 0 π

  -- ── Step 3: Change of variable M ↦ E via Kepler's equation ───────────
  -- Under M = E − e·sin E:  E'(M) dM = dE,  cos(nM) = cos(nE − ne·sinE)
  -- Mathlib: MeasureTheory.integral_image_eq_integral_abs_deriv_smul
  --       or intervalIntegral.integral_comp_deriv
  have step3 :
      ∫ M in (0 : ℝ)..π, deriv (eccAnom e) M * cos (↑n * M)
      = ∫ E in (0 : ℝ)..π, cos (↑n * E - ↑n * e * sin E) := by
    sorry

  -- ── Step 4: Recognise the Bessel integral ────────────────────────────
  -- (1/π)∫₀^π cos(nE − ne·sinE)dE = J_n(ne)  by definition.
  have step4 :
      (1 / π) * ∫ E in (0 : ℝ)..π, cos (↑n * E - ↑n * e * sin E)
      = besselJ n (↑n * e) := by
    simp only [besselJ]

  -- ── Combine all steps ────────────────────────────────────────────────
  have hpi : π > 0 := Real.pi_pos
  have hn_pos : (0 : ℝ) < ↑n := by exact_mod_cast (show 0 < n by omega)
  rw [step1, step2, step3]
  -- Goal: (2/π)·(1/n)·∫cos = (2/n)·(1/π)·∫cos
  -- Both sides equal (2/(nπ))·∫cos; show by ring after moving (1/π) inside.
  have factored : (2 / π) * (1 / ↑n) *
      ∫ E in (0 : ℝ)..π, cos (↑n * E - ↑n * e * sin E)
      = (2 / ↑n) * ((1 / π) *
      ∫ E in (0 : ℝ)..π, cos (↑n * E - ↑n * e * sin E)) := by
    ring
  rw [factored, step4]

end

/-! ---------------------------------------------------------------
    §6  Summary of proof obligations

    Axioms (all standard consequences of the Inverse Function Theorem
    applied to the smooth, strictly increasing map  keplerMap e):
      • eccAnom           — existence of the smooth inverse
      • eccAnom_kepler    — it satisfies Kepler's equation
      • eccAnom_hasDerivAt — C¹ with derivative 1/(1 − e·cos E)
      • eccAnom_zero      — E(0) = 0
      • eccAnom_pi        — E(π) = π

    `sorry`s (each closes with a standard Mathlib lemma):
      • step1        IBP via
                      intervalIntegral.integral_deriv_mul_eq_sub_of_hasDerivAt
      • cos_int_zero  ∫₀^π cos(nM)dM = 0 via
                      integral_cos + Real.sin_nat_mul_pi
      • step2 integrability  IntervalIntegrable for deriv(eccAnom e)·cos(n·)
      • step3        Change of variables via
                      MeasureTheory.integral_image_eq_integral_abs_deriv_smul

    §7  Finding minimal imports

    Once the file builds cleanly, add at the bottom:

        #min_imports

    and the Lean Infoview will report the exact minimal import list
    for the version of Mathlib you have installed.
    --------------------------------------------------------------- -/

How much certainty is worthwhile?

A couple weeks ago I wrote a post on a composition table, analogous to a multiplication table, for trig functions and inverse trig functions.

\renewcommand{\arraystretch}{2.2} \begin{array}{c|c|c|c} & \sin^{-1} & \cos^{-1} & \tan^{-1} \\ \hline \sin & x & \sqrt{1-x^{2}} & \dfrac{x}{\sqrt{1+x^2}} \\ \hline \cos & \sqrt{1-x^{2}} & x & \dfrac{1}{\sqrt{1 + x^2}} \\ \hline \tan & \dfrac{x}{\sqrt{1-x^{2}}} & \dfrac{\sqrt{1-x^{2}}}{x} & x \\ \end{array}

Making mistakes and doing better

My initial version of the table above had some errors that have been corrected. When I wrote a followup post on the hyperbolic counterparts of these functions I was more careful. I wrote a little Python code to verify the identities at a few points.

\renewcommand{\arraystretch}{2.2} \begin{array}{c|c|c|c} & \sinh^{-1} & \cosh^{-1} & \tanh^{-1} \\ \hline \sinh & x & \sqrt{x^{2}-1} & \dfrac{x}{\sqrt{1-x^2}} \\ \hline \cosh & \sqrt{x^{2} + 1} & x & \dfrac{1}{\sqrt{1 - x^2}} \\ \hline \tanh & \dfrac{x}{\sqrt{x^{2}+1}} & \dfrac{\sqrt{x^{2}-1}}{x} & x \\ \end{array}

Checking a few points

Of course checking an identity at a few points is not a proof. On the other hand, if you know the general form of the answer is right, then checking a few points is remarkably powerful. All the expressions above are simple combinations of a handful of functions: squaring, taking square roots, adding or subtracting 1, and taking ratios. What are the chances that a couple such combinations agree at a few points but are not identical? Very small; zero if you formalize the problem correctly. More on that in the next post.

In the case of polynomials, checking a few points may be sufficient. If two polynomials in one variable agree at enough points, they agree everywhere. This can be applied when it’s not immediately obvious that identity involves polynomials, such as proving theorems about binomial coefficients.

The Schwartz-Zippel lemma is a more sophisticated version of this idea that is used in zero knowledge proofs (ZKP). Statements to be proved are formulated as multivariate polynomials over finite fields. The Schwartz-Zippel lemma quantifies the probability that the polynomials could be equal at a few random points but not be equal everywhere. You can prove that a statement is correct with high probability by only checking a small number of points.

Achilles heel

The first post mentioned above included geometric proofs of the identities, but also had typos in the table. This is an important point: formally verified systems can and do contain bugs because there is inevitably some gap between what it formally verified and what is not. I could have formally verified the identities represented in the table, say using Lean, but introduced errors when I manually transcribe the results into LaTeX to make the diagram.

It’s naive to say “Well then don’t leave anything out. Formally verify everything.” It’s not possible to verify “everything.” And things that could in principle be verified may require too much effort to do so.

There are always parts of a system that are not formally verified, and these parts are where you need to look first for errors. If I had formally verified my identities in Lean, it would be more likely that I made a transcription error in typing LaTeX than that the Lean software had a bug that allowed a false statement to slip through.

Economics

The appropriate degree of testing or formal verification depends on the context. In the case of the two blog posts above, I didn’t do enough testing for the first but did do enough for the second: checking identities at a few random points was the right level of effort. Software that controls a pacemaker or a nuclear power plant requires a higher degree of confidence than a blog post.

Rigorously proving identities

Suppose you want to rigorously prove the identities in the tables above. You first have to specify your domains. Are the values of x real numbers or complex numbers? Extending to the complex numbers doesn’t make things harder; it might make them easier by making some problems more explicit.

The circular and hyperbolic functions are easy to define for all complex numbers, but the inverse functions, including the square root function, require more care. It’s more work than you might expect, but you can find an outline of a full development here. Once you have all the functions carefully defined, the identities can be verified by hand or by a CAS such as Mathematica. Or even better, by both.

Related posts

Automation and Validation

It’s been said whatever you can validate, you can automate. An AI that produces correct work 90% of the time could be very valuable, provided you have a way to identify the 10% of the cases where it is wrong. Often verifying a solution takes far less computation than finding a solution. Examples here.

Validating AI output can be tricky since the results are plausible by construction, though not always correct.

Consistency checks

One way to validate output is to apply consistency checks. Such checks are necessary, but not sufficient, and often easy to implement. An simple consistency check might be that inputs to a transaction equal outputs. A more sophisticated consistency check might be conservation of energy or something analogous to it.

Certificates

Some problems have certificates, ways of verifying that a calculation is correct that can be evaluated with far less effort than finding the solution that they verify. I’ve written about certificates in the context of optimization, solving equations, and finding prime numbers.

Formal methods

Correctness is more important in some contexts than others. If a recommendation engine makes a bad recommendation once in a while, the cost is a lower probability of conversion in a few instances. If an aircraft collision avoidance system makes an occasional error, the consequences could be catastrophic.

When the cost of errors is extremely high, formal verification may be worthwhile. Formal correctness proofs using something like Lean or Rocq are extremely tedious and expensive to create, and hence not economical. But if an AI can generate a result and a formal proof of correctness, hurrah!

Who watches the watchmen?

But if an AI result can be wrong, why couldn’t a formal proof generated to defend the result also be wrong? As the Roman poet Juvenal asked, Quis custodiet ipsos custodes? Who will watch the watchmen?

An AI could indeed generate an incorrect proof, but if it does, the proof assistant will reject it. So the answer to who will watch Claude, Gemini, and ChatGPT is Lean, Rocq, and Isabelle.

Who watches the watchers of the watchmen?

Isn’t it possible that a theorem prover like Rocq could have a bug? Of course it’s possible; there is no absolute certainty under the sun. But hundreds of PhD-years of work have gone into Rocq (formerly Coq) and so bugs in the kernel of that system are very unlikely. The rest of the system is bootstrapped, verified by the kernel.

Even so, an error in the theorem prover does not mean an error in the original result. For an incorrect result to slip through, the AI-generated proof would have to be wrong in a way that happens to exploit an unknown error in the theorem prover. It is far more likely that you’re trying to prove the wrong thing than that the theorem prover let you down.

I mentioned collision avoidance software above. I looked into collision avoidance software when I did some work for Amazon’s drone program. The software that was formally verified was also unrealistic in its assumptions. The software was guaranteed to work correctly, if two objects are flying at precisely constant velocity at precisely the same altitude etc. If everything were operating according to geometrically perfect assumptions, there would be no need for collision avoidance software.

When are formal proofs worth the effort?

Formal verification of theorems takes a lot of work. And it takes more work where it is least needed. But the good news is that it takes less effort in contexts where it is needed most.

Years of effort have gone into formally verifying the proofs of theorems that no one doubted were correct. For example, a team worked for six years to formally verify the proof of the odd-order theorem in group theory.

Mathematical proofs are fallible, but it’s rare for a proof to be accepted that reaches a wrong conclusion. Flaws in proofs are much more likely to be gaps in reasoning than to be arguments in favor of false statements. And the more people who study a proof, the more likely it is that flaws will be found and fixed.

The return on investment for formal verification is highest for theorems that are the opposite of the odd-order theorem, theorems that are obscure and shallow rather than famous and deep.

For example, suppose you want to prove that a complex set of security protocols doesn’t have any gaps. Your problem is obscure in that only your company cares about it. You don’t have the assurance that would come from thousands of mathematicians around the world reviewing your work.

But on the plus side, your security rules are logically shallow. These rules may be too complicated to hold in your head, but they can be formally specified with far less effort than some mathematical object like a Möbius strip. In general, discrete things like if-then rules are much easier to formalize than continuous things like real numbers.

Mathematicians choose to study things they have some intuition for, and that intuition is often correct even when the reasoning around it is logically incomplete. But businesses are often faced with complex problems they did not chose and problems for which no one has much intuition, such as proving that a complex circuit does what it’s supposed to do, or that an encryption protocol didn’t neglect some edge case. That’s where the ROI on formal methods is greatest.

Automated theorem proving

When I first heard of automated theorem proving, I imagined computers being programmed to search for mathematical theorems interesting to a wide audience. Maybe that’s what a few of the pioneers in the area had in mind too, but that’s not how things developed.

The biggest uses for automated theorem proving have been highly specialized applications, not mathematically interesting theorems. Computer chip manufacturers use formal methods to verify that given certain inputs their chips produce certain outputs. Compiler writers use formal methods to verify that their software does the right thing. A theorem saying your product behaves correctly is very valuable to you and your customers, but nobody else. These aren’t the kinds of theorems that anyone would cite the way they might site the Pythagorean theorem. Nobody would ever say “And therefore, by the theorem showing that this particular pacemaker will not fall into certain error modes, I now prove this result unrelated to pacemakers.”

Automated theorem provers are important in these highly specialized applications in part because the results are of such limited interest. For every theorem of wide mathematical interest, there are a large number of mathematicians who are searching for a proof or who are willing to scrutinize a proposed proof. A theorem saying that a piece of electronics performs correctly appeals to only the tiniest audience, and yet is probably much easier (for a computer) to prove.

The term “automated theorem proving” is overloaded to mean a couple things. It’s used broadly to include any use of computing in proving theorems, and it’s used more narrowly to mean software that searches for proofs or even new theorems. Most theorem provers in the broad sense are not automated theorem provers in the more narrow sense but rather proof assistants. They verify proofs rather than discover them. (There’s some gray zone. They may search on a small scale, looking for a way to prove a minor narrow result, but not search for the entire proof to a big theorem.) There have been computer-verified proofs of important mathematical theorems, such as the Feit-Thompson theorem from group theory, but I’m not aware of any generally interesting discoveries that have come out of a theorem prover.

Related post: Formal methods let you explore the corners

Formal methods let you explore the corners

I heard someone say the other day that the advantage of formal software validation methods is that they let you explore the corners, cases where intuition doesn’t naturally take you.

This made me think of corners in the geometric sense. If you have a sphere in a box in high dimensions, nearly all the volume is in the corners, i.e. outside the sphere. This is more than a metaphor. You can think of software options geometrically, with each independent choice corresponding to a dimension. Paths through a piece of software that are individually rare may account for nearly all use when considered together.

With a circle inside a square, nearly 78.5% of the area is inside the circle. With a ball sitting inside a 3-D box, 52.4% of the volume is inside the ball. As the dimension increases, the proportion of volume inside the sphere rapidly decreases. For a 10-dimensional sphere sitting in a 10-dimensional box, 0.25% of the volume is in the sphere. Said another way, 99.75% of the volume is in the corners.

When you go up to 100 dimensions, the proportion of volume inside the sphere is about 2 parts in 1070, a 1 followed by 70 zeros [1]. If 100 dimensions sounds like pure fantasy, think about a piece of software with more than 100 features. Those feature combinations multiply like geometric dimensions [2].

Here’s a little Python code you could use to see how much volume is in a sphere as a function of dimension.

    from scipy.special import gamma
    from math import pi

    def unit_sphere_volume(n): 
        return pi**(0.5*n)/gamma(0.5*n + 1)

    def unit_cube_volume(n): 
        return 2**n

    def ratio(n):
        return unit_sphere_volume(n) / unit_cube_volume(n)

    print( [ratio(n) for n in range(1, 20)] )

* * *

[1] There are names for such extremely large numbers. These names are hardly ever used—scientific notation is much more practical— but they’re fun to say. 1070 is ten duovigintillion in American nomenclature, ten undecilliard in European.

[2] Geometric dimensions are perfectly independent, but software feature combinations are not. In terms of logic, some combinations may not be possible. Or in terms of probability, the probability of exploring some paths is conditional on the probability of exploring other paths. Even so, there are inconceivably many paths through any large software system. And in large-scale operations, events that should “never happen” happen regularly.