Can regular expressions parse HTML or not?

Can regular expressions parse HTML? There are several answers to that question, both theoretical and practical.

First, let’s look at theoretical answers.

When programmers first learn about regular expressions, they often try to use them on HTML. Then someone wise will tell them “You can’t do that. There’s a computer science theorem that says regular expressions are not powerful enough.” And that’s true, if you stick to the original meaning of “regular expression.”

But if you interpret “regular expression” the way it is commonly used today, then regular expressions can indeed parse HTML. This post [Update: link went away] by Nikita Popov explains that what programmers commonly call regular expressions, such as PCRE (Perl compatible regular expressions), can match context-free languages.

Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.

So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE can.

Now on to the practical answers. The next lines in Nikita Popov’s post say

But don’t forget two things: Firstly, most HTML you see in the wild is not well-formed (usually not even close to it). And secondly, just because you can, doesn’t mean that you should.

HTML in the wild can be rather wild. On the other hand, it can also be simpler than the HTML grammar allows. In practice, you may be able to parse a particular bit of HTML with regular expressions, even old fashioned regular expressions. It depends entirely on context, your particular piece of (possibly malformed) HTML and what you’re trying to do with it. I’m not advocating regular expressions for HTML parsing, just saying that the question of whether they work is complicated.

This opens up an interesting line of inquiry. Instead of asking whether strict regular expressions can parse strict HTML, you could ask what is the probability that a regular expression will succeed at a particular task for an HTML file in the wild. If you define “HTML” as actual web pages rather than files conforming to a particular grammar, every technique will fail with some probability. The question is whether that probability is acceptable in context, whether using regular expressions or any other technique.

Related post: Coming full circle

Free damped vibrations

This is the second post in a series on vibrations determine by the equation

m u'' + γ u' + k u = F cos ωt

The first post in the series looked at the simplest case, γ = 0 and F = 0. Now we’ll look at the more realistic and more interesting case of damping γ > 0. We won’t consider forcing yet, so F = 0 and our equation reduces to

m u'' + γ u' + k u = 0.

The effect of damping obviously depends on the amount of damping, i.e. the size of γ relative to other terms. Damping removes energy from the system and so the amplitude of the oscillations goes to zero over time, regardless of the amount of damping. However, the system can have three qualitatively different behaviors: under-damping, critical damping, and over-damping.

The characteristic polynomial of the equation is

m x2 + γ x + k

This polynomial has either two complex roots, one repeated real root, or two real roots depending on whether the discriminant γ2 – 4mk is negative, zero, or positive. These three states correspond to under-damping, critical damping, and over-damping.

Under-damping

When damping is small, the system vibrates at first approximately as if there were no damping, but the amplitude of the solutions decreases exponentially. As long as γ2 < 4mk the system is under-damped and the solution is

u(t) = R exp( –γt/2m ) cos( μt– φ )

where

μ = √(4mk – γ2) / 2m.

The amplitude R and the phase φ are determined by the initial conditions.

As an example, let γ =1, m = 1, and k = 5. This implies μ =  √19/2. Set the initial conditions so that R = 6 and φ = 1. The solid blue line below is the solution 6 exp(-t/2) cos(t– 1), and the dotted green lines above and below are the exponential envelope 6 exp(-t/2).

And here is a video of the mass-spring-dashpot system in action made using the code discussed here.

Note that the solution is not simply an oscillation at the natural frequency multiplied by an exponential term. The damping changes the frequency of the periodic term, though the change is small when the damping is small.

The factor μ is called the quasi-frequency. Recall from the previous post that the natural frequency, the frequency of the solution with no damping, is denoted ω0. The quasi-frequency is related to the natural frequency by

μ = √(1 – γ2/4mk) ω0 ≈ (1 – γ2/8mk) ω0.

The approximation above holds for small γ since a Taylor expansion shows that √(1 – x) ≈ 1 – x/2 for small x.

This says that the quasi-frequency decreases the natural frequency by a factor of approximately γ2/8mk. So when γ is small there is little change to the natural frequency, but the amount of change grows quadratically as γ increases.

Critical damping

Critical damping occurs when γ2 = 4mk. In this case the solution is given by

u(t) = (A + Bt) exp( – γt/2m).

The position of the mass will asymptotically approach 0. Depending on the initial velocity, it will either go monotonically to zero or reach some maximum displacement before it turns around and goes to 0.

For an example, let m = 1 and k = 5 as before. For critical damping, γ must equal √20. Choose initial conditions so that A = 1 and B = 10.

Over-damping

When γ2 > 4mk the system is over-damped. When damping is that large, the system does not oscillate at all.

The solution is

u(t) = A exp(r1 t) + B exp(r2 t)

where r1 and r2 are the two roots of

m x2 + γ x + k = 0.

Because γ > 0, both roots are negative and solutions decay exponentially to 0.

Coming up

The next two posts in the series will add a forcing term, i.e. F > 0 in our differential equation. The next post will look at undamped driven vibrations, and the final post will look at damped driven vibrations.

Python animation for mechanical vibrations

Stéfan van der Walt wrote some Python code to animate the system described in yesterday’s post on mechanical vibrations.

Stéfan posted his code on github. It currently illustrates undamped free vibrations, but could be modified to work with damped or driven vibrations.

The code includes an option to save the output as an mp4 file. You must have ffmpeg installed to save a matplotlib animation as mp4.

Fourier series before Fourier

I always thought that Fourier was the first to come up with the idea of expressing general functions as infinite sums of sines and cosines. Apparently this isn’t true.

The idea that various functions can be described in terms of Fourier series … was for the first time proposed by Daniel Bernoulli (1700–1782) to solve the one-dimensional wave equation (the equation of motion of a string) about 50 years before Fourier. … However, no one contemporaneous to D. Bernoulli accepted the idea as a general method, and soon the study was forgotten.

Source: The Nonlinear World

Perhaps Fourier’s name stuck to the idea because he developed it further than Bernoulli did.

Related posts

History of weather prediction

I’ve just started reading Invisible in the Storm: The Role of Mathematics in Understanding Weather, ISBN 0691152721.

The subtitle may be a little misleading. There is a fair amount of math in the book, but the ratio of history to math is pretty high. You might say the book is more about the role of mathematicians than the role of mathematics. As Roger Penrose says on the back cover, the book has “illuminating descriptions and minimal technicality.”

Someone interested in weather prediction but without a strong math background would enjoy reading the book, though someone who knows more math will recognize some familiar names and theorems and will better appreciate how they fit into the narrative.

Related posts

Mechanical vibrations

vibration damper

My favorite topic in an introductory differential equations course is mechanical and electrical vibrations. I enjoyed learning about it as a student and I enjoyed teaching it later. (Or more accurately, I enjoyed being exposed to it as a student and really learning it later when I had to teach it.)

I find this subject interesting for three reasons.

  1. The same equations describe a variety of mechanical and electrical systems.
  2. You can get practical use out of some relatively simple math.
  3. The solutions display wide variety of behavior as you vary the coefficients.

This is the first of a four-part series of posts on mechanical vibrations. The posts won’t be consecutive: I’ll write about other things in between.

Simple mechanical vibrations satisfy the following differential equation:

m u'' + gamma u' + k u = F cos omega t

We could simply write down the general solution be done with it. But the focus here won’t be finding the solutions but rather understanding how the solutions behave.

We’ll think of our equation as modeling a system with a mass attached to a spring and a dash pot. All coefficients are constant. m is the mass, γ is the damping from the dash pot, and k is the restoring force from the spring. The driving force has amplitude F and frequency ω. The solution u(t) gives the position of the mass at time t.

More complicated vibrations, such as a tall building swaying in the wind, can be approximated by this simple setting.

The same differential equation could model an electrical circuit with an inductor, resistor, and capacitor. In that case replace mass m with the inductance L, damping γ with resistance R, and spring constant k with the reciprocal of capacitance C. Then the equation gives the charge on the capacitor at time t.

We will assume m and k are positive. The four blog posts will correspond to γ zero or positive, and F zero or non-zero. Since γ represents damping, the system is called undamped when γ = 0 and damped when γ is greater than 0. And since F is the amplitude of the forcing function, the system is called free when F = 0 and forced otherwise. So the plan for the four posts is

Free, undamped vibrations

With no damping and no forcing, our equation is simply

m u'' + k u = 0

and we can write down the solution

u(t) = A sin ω0t + B cos ω0t

where

ω02 = k/m.

The value ω0 is called the natural frequency of the system because it gives the frequency of vibration when there is no forcing function.

The values of A and B are determined by the initial conditions, i.e. u(0)  = B and u(0) = A ω0.

Since the sine and cosine components have the same frequency ω0, we can use a trig identity to combine them into a single function

u(t) = R cos(ω0t – φ)

The amplitude R and phase φ are related to the parameters A and B by

A = R cos φ, B = R sin φ.

That’s it for undamped free vibrations: the solutions are just sine waves. The next post in the series will make things more realistic and more interesting by adding damping.

Update: Here’s an animation of undamped free oscillation using code by Stéfan van der Walt described here.

Consulting in differential equations

Unicode to LaTeX

I’ve run across a couple websites that let you enter a LaTeX symbol and get back its Unicode value. But I didn’t find a site that does the reverse, going from Unicode to LaTeX, so I wrote my own.

Unicode / LaTeX Conversion

If you enter Unicode, it will return LaTeX. If you enter LaTeX, it will return Unicode. It interprets a string starting with “U+” as a Unicode code point, and a string starting with a backslash as a LaTeX command.

screenshot of www.johndcook.com/unicode_latex.png

For example, the screenshot above shows what happens if you enter U+221E and click “convert.” You could also enter infty and get back U+221E.

However, if you go from Unicode to LaTeX to Unicode, you won’t always end up where you started. There may be multiple Unicode values that map to a single LaTeX symbol. This is because Unicode is semantic and LaTeX is not. For example, Unicode distinguishes between the Greek letter Ω and the symbol Ω for ohms, the unit of electrical resistance, but LaTeX does not.

Letters that fell out of the alphabet

Mental Floss had an interesting article called 12 letters that didn’t make the alphabet. A more accurate title might be 12 letters that fell out of the modern English alphabet.

I thought it would have been better if the article had included the Unicode values of the letters, so I did a little research and created the following table.

Name Capital Small
Thorn U+00DE U+00FE
Wynn U+01F7 U+01BF
Yogh U+021C U+021D
Ash U+00C6 U+00E6
Eth U+00D0 U+00F0
Ampersand U+0026
Insular g U+A77D U+1D79
Thorn with stroke U+A764 U+A765
Ethel U+0152 U+0153
Tironian ond U+204A
Long s U+017F
Eng U+014A U+014B

 

Once you know the Unicode code point for a symbol, you can find out more about it, for example, here.