Can regular expressions parse HTML? There are several answers to that question, both theoretical and practical.
First, let’s look at theoretical answers.
When programmers first learn about regular expressions, they often try to use them on HTML. Then someone wise will tell them “You can’t do that. There’s a computer science theorem that says regular expressions are not powerful enough.” And that’s true, if you stick to the original meaning of “regular expression.”
But if you interpret “regular expression” the way it is commonly used today, then regular expressions can indeed parse HTML. This post by Nikita Popov explains that what programmers commonly call regular expressions, such as PCRE (Perl compatible regular expressions), can match context-free languages.
Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.
So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE can.
Now on to the practical answers. The next lines in Nikita Popov’s post say
But don’t forget two things: Firstly, most HTML you see in the wild is not well-formed (usually not even close to it). And secondly, just because you can, doesn’t mean that you should.
HTML in the wild can be rather wild. On the other hand, it can also be simpler than the HTML grammar allows. In practice, you may be able to parse a particular bit of HTML with regular expressions, even old fashioned regular expressions. It depends entirely on context, your particular piece of (possibly malformed) HTML and what you’re trying to do with it. I’m not advocating regular expressions for HTML parsing, just saying that the question of whether they work is complicated.
This opens up an interesting line of inquiry. Instead of asking whether strict regular expressions can parse strict HTML, you could ask what is the probability that a regular expression will succeed at a particular task for an HTML file in the wild. If you define “HTML” as actual web pages rather than files conforming to a particular grammar, every technique will fail with some probability. The question is whether that probability is acceptable in context, whether using regular expressions or any other technique.
Related post: Coming full circle