Suppose you have a computer that can evaluate and compose continuous functions of one real variable and can do addition. What kinds of functions could you compute with it? You could compute functions of one variable by definition, but could you bootstrap it to compute functions of two variables?

Here’s an example that shows this computer might be more useful than it seems at first glance. We said it can do addition, but can it multiply? Indeed it can [1].

We can decompose the function

into

where

So multiplication of two variables can be decomposed into the addition and composition of functions of one variable. What other functions of two variables can be decomposed this way? What about functions of three or more variables?

The Kolmogorov-Arnol’d theorem says that *all* continuous functions, of however many variables you like, can be decomposed this way.

Here is the original version of the Kolmogorov-Arnol’d theorem from 1957:

Let

f: [0,1]^{n}→Rbe an arbitrary multivariate continuous function. Then it has the representationwhere the univariate functions Φ

_{q}and ψ_{q,p}are continuous and defined on the real line. Furthermore the ψ_{q,p}can be chosen independent off.

Later refinements showed that it is possible to get by with a single outer function Φ depending on *f* , but independent of *q* [1].

The original theorem was not constructive, but later Jürgen Braun and Michael Griebel gave a constructive proof [2].

## References

- Sidney A. Morris. “Hilbert 13: Are there any genuine continuous multivariate real-valued functions?” Journal: Bull. Amer. Math. Soc. DOI: https://doi.org/10.1090/bull/1698. Posted online July 6, 2020
- Jürgen Braun, Michael Griebel. On a Constructive Proof of Kolmogorov’s Superposition Theorem. Constructive Approximation 30, 653 (2009). https://doi.org/10.1007/s00365-009-9054-2

Is it possible to explicity modify the example to extend the domain to all of R^2?

You can certainly use the

extendedreals, since the extended reals are homeomorphic to the unit interval.You could compose with something like inverse tangent and rescale to map R to [0, 1], but I think you need continuity at ± infinity.

In the example, the domain of g_2(x) restricts the allowed values of x. Could there be a way to lift this restriction ?

For the Neural Networks community, Hilbert’s 13th problem (at the International Congress of Mathematicians in 1900 “if a certain specific function of three variables can be written as a multiple (yet finite) composition of continuous functions of just two variables” – see for example https://www.math.toronto.edu/~drorbn/Talks/Fields-0911/Hilbert13th.pdf ) was the Big Bang moment.

Perhaps in the 1960’s that Big Bang failed to ignite despite the presence of “flammable materials” (at least not without today’s hindsight) but definetely it did “Bang” Bigly [sic :)] in the 1980’s with the application of Backpropagation to multi-layer NN.

The last equation you cite represents more-or-less a 3-layer (1 input of dim N, 1 “hidden” of size 2N, 1 output of dim 1) and the psi’s to be independent of f. In today’s NN jargon the psi’s the non-linearities / sigmoidal functions. It was later adjusted for the case of multi-layer NN and was called “Kolmogorov’s Mapping Neural Network Existence Theorem” by Hecht-Nielsen ( https://cs.uwaterloo.ca/~y328yu/classics/Hecht-Nielsen.pdf )

I am not sure who was the first to link Kolmogorov’s theorem to NN, perhaps Hecht-Nielsen in 1989? But almost immediately enough momentum was gathered. See for example Kurkova’s https://www.cs.cas.cz/~vera/publications/journals/kolmogorov_relevant_small.pdf

The story is perhaps relevant to your next post on Math History ( https://www.johndcook.com/blog/2020/07/16/math-history/ ) as an Appendix perhaps titled “The Parallel and Sealed Worlds of Mathematics”? Lorentz, Sprecher and I guess lots of others were re-visiting Kolmogorov’s 1957 theorem since a few years after it was published while at the same time 1960’s NN research was “begging” for such an existence theorem which would theoretically, at least, ensure research funding.

As the NN field was also “begging” for a training algorithm for deeper NN while Backpropagation was around since the 1960’s as just another application of Chain Rule.

Given that today’s AI applications are mostly based on NN, I wonder what would have happened if it all started 20 years earlier. I guess nothing because even when NN were revived in 1990’s it took another 20/25 years to reach today’s level. Which is fair since other factors were more important, for example: computing power and price, the vast scale of data sharing via the Internet as a means of sharing training data and demand for “apps” and of course the current state of Capitalism. The accumulation of Capital, responsible for the financial crises, this time was translated to bigger risks, easy funding for start-ups meaning accelerated invention of new consumer needs so that it can satisfy them. So in this case NN is neither the chicken nor the egg but rather chicken food and it does not matter if it was revived in the 1960’s or 1990’s or 2010’s! – in my opinion.

I was also wondering about why the theorem was named after “Kolmogorov-Arnol’d” although it was published by Kolmogorov alone. Hecht-Nielsen’s paper cited above states that its discovery was a “friendly duel” between the two Soviet Mathematicians who set to resolve all remaining questions posed by Hilbert 50 years earlier. A duel won by Kolmogorov.

I have also posted some, perhaps overlapping, information about NN and the 1960’s in your other excellent post many years ago:

https://www.johndcook.com/blog/2016/10/14/the-big-deal-about-neural-networks/

bw,

Andreas Hadjiprocopis

In my previous comment I missed an additional connection for the K-A theorem, regarding the neuronal-like analogies of NN, i.e. the neurons, as the simple building blocks.

It’s difficult or impossible to pinpoint exactly the role of neurons in a monolithic ad-hoc transfer function in R^N.

Whereas in its equivalent *superposition* of *simple* functions in R^1, thanks to Kolmogorov-Arnol’d, this analogy/mapping is quite clear and immediately opens to the *distributivity* and massive scale of the brain, linking the ideas of D.Rumelhart, J.McCleland, Parallel Distributed Processing et al.

bw,

Andreas Hadjiprocopis