<0| (a <1|)* (|1> b)* |0>

]]>Terminals commute with stack operations:

a <n| = <n| a

a |n> = |n> a

Stack operations are orthonormal…

<n| |n> = 1

<m| |n> = 0

…and a complete basis:

|0> <0| + |1> <1| + … + |N> <N| = 1

]]>In what follows, we will denote the empty set as 0, the empty string as 1, and set union as +. This is justified because it means that regular expressions are an idempotent semi-ring (idempotent because A+A=A), plus the Kleene closure.

We assume that there are N+1 stack symbols 0..N, where 0 represents a sentinel symbol at the base of the stack. (We don’t strictly need symbol 0, but it makes things a little easier to describe.) Then we can represent a push of symbol m by . The reason for this notation will become clear in a moment.

So, for example, we can recognise a^n b^n with the regular expression:

<0| (a )* |0>

We need some additional axioms. First, terminal symbols commute with stack operations:

a <n| = = |n> a

Finally, we describe what happens when pushes meet pops:

= 1

= 0, if m != n

|0> <N| = 1

So the stack symbols are like orthonormal basis vectors with is the inner product (|n> is a vector, and <n| is its dual vector/one-form). The final axiom states that the set of basis vectors is complete. The fact that terminals commute with stack symbols mean that strings of terminals are the "scalars" of the vector field.

The axioms of context-free expressions are, in summary, very similar to those of a spinor algebra.

The neat thing about this is that it generalises in an obvious way. Add a second stack (or a richer set of stack state symbols with algebra to match), and you have "Turing expressions". Add the possibilities for inner products to return values other than 0 or 1, and you have quantum computing.

]]>“Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.”

As far as I know according to computer science theory, there is a difference between context-free language (which can be generated by context-free grammar) and a regular language (which can be matched using regular expression).

In other words, you can’t match any context-free language with regular expression, right? ]]>

Problem here: XPath ist only working with wellformed Documents working on top of a browser parser is not that quick und and simple.

The real problem is the mess you get out of the net ]]>

Because well-formed HTML is e.g. still allowed to omit start-tag for body, but have closing tag for it and vice-versa.

I think you could certainly tokenize it, but I can’t imagine handling all the messy stack manipulation that is allowed even in valid HTML Strict.

]]>`htmlparsehack.pl`

failing.
]]>Can PCRE really parse a larger set of grammars than Bison or is somebody just overgeneralizing the fact that PCRE can handle languages that are a superset of regular expressions?

Life is made much simpler by pointing people at the Chomsky hierarchy

]]>[Same link as in first comment. — JC]

]]>