Can regular expressions parse HTML or not?

Can regular expressions parse HTML? There are several answers to that question, both theoretical and practical.

First, let’s look at theoretical answers.

When programmers first learn about regular expressions, they often try to use them on HTML. Then someone wise will tell them “You can’t do that. There’s a computer science theorem that says regular expressions are not powerful enough.” And that’s true, if you stick to the original meaning of “regular expression.”

But if you interpret “regular expression” the way it is commonly used today, then regular expressions can indeed parse HTML. This post [Update: link went away] by Nikita Popov explains that what programmers commonly call regular expressions, such as PCRE (Perl compatible regular expressions), can match context-free languages.

Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.

So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE can.

Now on to the practical answers. The next lines in Nikita Popov’s post say

But don’t forget two things: Firstly, most HTML you see in the wild is not well-formed (usually not even close to it). And secondly, just because you can, doesn’t mean that you should.

HTML in the wild can be rather wild. On the other hand, it can also be simpler than the HTML grammar allows. In practice, you may be able to parse a particular bit of HTML with regular expressions, even old fashioned regular expressions. It depends entirely on context, your particular piece of (possibly malformed) HTML and what you’re trying to do with it. I’m not advocating regular expressions for HTML parsing, just saying that the question of whether they work is complicated.

This opens up an interesting line of inquiry. Instead of asking whether strict regular expressions can parse strict HTML, you could ask what is the probability that a regular expression will succeed at a particular task for an HTML file in the wild. If you define “HTML” as actual web pages rather than files conforming to a particular grammar, every technique will fail with some probability. The question is whether that probability is acceptable in context, whether using regular expressions or any other technique.

Related post: Coming full circle

18 thoughts on “Can regular expressions parse HTML or not?

  1. Interesting post.. But HTML is context free only if we hard code context-free expressions for all valid HTML tags. On the other hand, if we wish to parse XML (even well-formed ones) where tags can be arbitrary, even a context-free parser won’t work. A well known result in theoretical computer science says that expressions of the form wcw where w is a string (not a character) is not in CFL..

  2. Perhaps we should start referring to abnormalities like PCRE as “irregular expression engines”. :-)

  3. A fundamental distinction must always be made between really PARSING html with regex versus (usually one-off, quick and dirty) SEARCHING or MATCHING some html with regex.

  4. To be exact HTML is specified by a grammar that can be written in a context free form (the syntax specification of some languages is written in a form that is not context free and some reorganization is needed to make it so, or not {as is the case with C++}). This does not mean that it can be parsed using tools such as Bison since they require a LALR(1) grammar (a subset of context free).

    Can PCRE really parse a larger set of grammars than Bison or is somebody just overgeneralizing the fact that PCRE can handle languages that are a superset of regular expressions?

    Life is made much simpler by pointing people at the Chomsky hierarchy

  5. The problem is not whether regexes can parse HTML; the problem is whether it’s a good idea to keep writing irregular expressions for HTML in toolkits that can easily go exponential when used carelessly. There are already many good tools for this task; even ones that repair broken HTML while retaining (a guess at) its intended structure (tidy, LXML, BeautifulSoup, to name a few). The probability of those failing is still non-zero, but smaller than that of htmlparsehack.pl failing.

  6. Do you really mean well-formed HTML, or rather XHTML?

    Because well-formed HTML is e.g. still allowed to omit start-tag for body, but have closing tag for it and vice-versa.

    I think you could certainly tokenize it, but I can’t imagine handling all the messy stack manipulation that is allowed even in valid HTML Strict.

  7. I had two projects involving analysing all sorts auf crappy HTML with RegEx. It works to some extend, but after a while you start to wish profoundly another Tool to solve the issue.
    Problem here: XPath ist only working with wellformed Documents working on top of a browser parser is not that quick und and simple.
    The real problem is the mess you get out of the net :-)

  8. You quoted:
    “Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.”
    As far as I know according to computer science theory, there is a difference between context-free language (which can be generated by context-free grammar) and a regular language (which can be matched using regular expression).
    In other words, you can’t match any context-free language with regular expression, right?

  9. Babluki: Regular expressions (in the original sense) match regular languages. Context-free languages are more general, higher up the Chomsky hierarchy, and so cannot be described by (classical) regular expressions. But regular expressions in the contemporary sense can match context-free languages.

  10. There is also a nice formalism for extending regular expressions to context-free languages. Context-free languages can be recognised by push-down automata, which are basically DFAs or NFAs with stack operations. Why not just put the stack operations in the language?

    In what follows, we will denote the empty set as 0, the empty string as 1, and set union as +. This is justified because it means that regular expressions are an idempotent semi-ring (idempotent because A+A=A), plus the Kleene closure.

    We assume that there are N+1 stack symbols 0..N, where 0 represents a sentinel symbol at the base of the stack. (We don’t strictly need symbol 0, but it makes things a little easier to describe.) Then we can represent a push of symbol m by . The reason for this notation will become clear in a moment.

    So, for example, we can recognise a^n b^n with the regular expression:

    <0| (a )* |0>

    We need some additional axioms. First, terminal symbols commute with stack operations:

    a <n| = = |n> a

    Finally, we describe what happens when pushes meet pops:

    = 1
    = 0, if m != n
    |0> <N| = 1

    So the stack symbols are like orthonormal basis vectors with is the inner product (|n> is a vector, and <n| is its dual vector/one-form). The final axiom states that the set of basis vectors is complete. The fact that terminals commute with stack symbols mean that strings of terminals are the "scalars" of the vector field.

    The axioms of context-free expressions are, in summary, very similar to those of a spinor algebra.

    The neat thing about this is that it generalises in an obvious way. Add a second stack (or a richer set of stack state symbols with algebra to match), and you have "Turing expressions". Add the possibilities for inner products to return values other than 0 or 1, and you have quantum computing.

  11. Erm… looks like my stack notation ran foul of HTML. Here are those axioms again.

    Terminals commute with stack operations:

    a <n| = <n| a
    a |n> = |n> a

    Stack operations are orthonormal…

    <n| |n> = 1
    <m| |n> = 0

    …and a complete basis:

    |0> <0| + |1> <1| + … + |N> <N| = 1

  12. To expand on heltonbiker’s comment, I’ve answered many questions from people who say they want to parse HTML really mean they want to search a text document that happens to contain HTML. If you’re not concerned with the structure, you’re not parsing it. Pulling all of the URLs or email addresses from a web page can be done with regular expressions.

  13. We hired a consultant to do our DITA parsing. When the project manager said he was using ReEdit, well, that didn’t end well. He should have used XSL.

Comments are closed.