<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Can regular expressions parse HTML or not?</title>
	<atom:link href="http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/</link>
	<description>John D. Cook</description>
	<lastBuildDate>Thu, 23 May 2013 10:27:37 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
	<item>
		<title>By: Aaron</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-780</link>
		<dc:creator>Aaron</dc:creator>
		<pubDate>Mon, 25 Feb 2013 18:41:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-780</guid>
		<description><![CDATA[To expand on heltonbiker&#039;s comment, I&#039;ve answered many questions from people who say they want to parse HTML really mean they want to search a text document that happens to contain HTML. If you&#039;re not concerned with the structure, you&#039;re not parsing it. Pulling all of the URLs or email addresses from a web page can be done with regular expressions.]]></description>
		<content:encoded><![CDATA[<p>To expand on heltonbiker&#8217;s comment, I&#8217;ve answered many questions from people who say they want to parse HTML really mean they want to search a text document that happens to contain HTML. If you&#8217;re not concerned with the structure, you&#8217;re not parsing it. Pulling all of the URLs or email addresses from a web page can be done with regular expressions.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pseudonym</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-779</link>
		<dc:creator>Pseudonym</dc:creator>
		<pubDate>Mon, 25 Feb 2013 01:16:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-779</guid>
		<description><![CDATA[Oh, and a^n b^n is:

&lt;0&#124; (a &lt;1&#124;)* (&#124;1&gt; b)* &#124;0&gt;]]></description>
		<content:encoded><![CDATA[<p>Oh, and a^n b^n is:</p>
<p>&lt;0| (a &lt;1|)* (|1&gt; b)* |0&gt;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pseudonym</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-778</link>
		<dc:creator>Pseudonym</dc:creator>
		<pubDate>Mon, 25 Feb 2013 01:15:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-778</guid>
		<description><![CDATA[Erm... looks like my stack notation ran foul of HTML. Here are those axioms again.

Terminals commute with stack operations:

a &lt;n&#124; = &lt;n&#124; a
a &#124;n&gt; = &#124;n&gt; a

Stack operations are orthonormal...

&lt;n&#124; &#124;n&gt; = 1
&lt;m&#124; &#124;n&gt; = 0

...and a complete basis:

&#124;0&gt; &lt;0&#124; + &#124;1&gt; &lt;1&#124; + ... + &#124;N&gt; &lt;N&#124; = 1]]></description>
		<content:encoded><![CDATA[<p>Erm&#8230; looks like my stack notation ran foul of HTML. Here are those axioms again.</p>
<p>Terminals commute with stack operations:</p>
<p>a &lt;n| = &lt;n| a<br />
a |n&gt; = |n&gt; a</p>
<p>Stack operations are orthonormal&#8230;</p>
<p>&lt;n| |n&gt; = 1<br />
&lt;m| |n&gt; = 0</p>
<p>&#8230;and a complete basis:</p>
<p>|0&gt; &lt;0| + |1&gt; &lt;1| + &#8230; + |N&gt; &lt;N| = 1</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pseudonym</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-777</link>
		<dc:creator>Pseudonym</dc:creator>
		<pubDate>Mon, 25 Feb 2013 01:11:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-777</guid>
		<description><![CDATA[There is also a nice formalism for extending regular expressions to context-free languages. Context-free languages can be recognised by push-down automata, which are basically DFAs or NFAs with stack operations. Why not just put the stack operations in the language?

In what follows, we will denote the empty set as 0, the empty string as 1, and set union as +. This is justified because it means that regular expressions are an idempotent semi-ring (idempotent because A+A=A), plus the Kleene closure.

We assume that there are N+1 stack symbols 0..N, where 0 represents a sentinel symbol at the base of the stack. (We don&#039;t strictly need symbol 0, but it makes things a little easier to describe.)  Then we can represent a push of symbol m by . The reason for this notation will become clear in a moment.

So, for example, we can recognise a^n b^n with the regular expression:

&lt;0&#124; (a )* &#124;0&gt;

We need some additional axioms. First, terminal symbols commute with stack operations:

a &lt;n&#124; =  = &#124;n&gt; a

Finally, we describe what happens when pushes meet pops:

 = 1
 = 0, if m != n
&#124;0&gt;   &lt;N&#124; = 1

So the stack symbols are like orthonormal basis vectors with  is the inner product (&#124;n&gt; is a vector, and &lt;n&#124; is its dual vector/one-form). The final axiom states that the set of basis vectors is complete. The fact that terminals commute with stack symbols mean that strings of terminals are the &quot;scalars&quot; of the vector field.

The axioms of context-free expressions are, in summary, very similar to those of a spinor algebra.

The neat thing about this is that it generalises in an obvious way. Add a second stack (or a richer set of stack state symbols with algebra to match), and you have &quot;Turing expressions&quot;. Add the possibilities for inner products to return values other than 0 or 1, and you have quantum computing.]]></description>
		<content:encoded><![CDATA[<p>There is also a nice formalism for extending regular expressions to context-free languages. Context-free languages can be recognised by push-down automata, which are basically DFAs or NFAs with stack operations. Why not just put the stack operations in the language?</p>
<p>In what follows, we will denote the empty set as 0, the empty string as 1, and set union as +. This is justified because it means that regular expressions are an idempotent semi-ring (idempotent because A+A=A), plus the Kleene closure.</p>
<p>We assume that there are N+1 stack symbols 0..N, where 0 represents a sentinel symbol at the base of the stack. (We don&#8217;t strictly need symbol 0, but it makes things a little easier to describe.)  Then we can represent a push of symbol m by . The reason for this notation will become clear in a moment.</p>
<p>So, for example, we can recognise a^n b^n with the regular expression:</p>
<p>&lt;0| (a )* |0&gt;</p>
<p>We need some additional axioms. First, terminal symbols commute with stack operations:</p>
<p>a &lt;n| =  = |n&gt; a</p>
<p>Finally, we describe what happens when pushes meet pops:</p>
<p> = 1<br />
 = 0, if m != n<br />
|0&gt;   &lt;N| = 1</p>
<p>So the stack symbols are like orthonormal basis vectors with  is the inner product (|n&gt; is a vector, and &lt;n| is its dual vector/one-form). The final axiom states that the set of basis vectors is complete. The fact that terminals commute with stack symbols mean that strings of terminals are the &quot;scalars&quot; of the vector field.</p>
<p>The axioms of context-free expressions are, in summary, very similar to those of a spinor algebra.</p>
<p>The neat thing about this is that it generalises in an obvious way. Add a second stack (or a richer set of stack state symbols with algebra to match), and you have &quot;Turing expressions&quot;. Add the possibilities for inner products to return values other than 0 or 1, and you have quantum computing.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-776</link>
		<dc:creator>John</dc:creator>
		<pubDate>Sat, 23 Feb 2013 12:54:38 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-776</guid>
		<description><![CDATA[Babluki: Regular expressions (in the original sense) match regular languages. Context-free languages are more general, higher up the Chomsky hierarchy, and so cannot be described by (classical) regular expressions. But regular expressions in the contemporary sense can match context-free languages.]]></description>
		<content:encoded><![CDATA[<p>Babluki: Regular expressions (in the original sense) match regular languages. Context-free languages are more general, higher up the Chomsky hierarchy, and so cannot be described by (classical) regular expressions. But regular expressions in the contemporary sense can match context-free languages.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Babluki</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-775</link>
		<dc:creator>Babluki</dc:creator>
		<pubDate>Sat, 23 Feb 2013 10:56:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-775</guid>
		<description><![CDATA[You quoted:
&quot;Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.&quot;
As far as I know according to computer science theory, there is a difference between context-free language (which can be generated by context-free grammar) and a regular language (which can be matched using regular expression).
In other words, you can&#039;t match any context-free language with regular expression, right?]]></description>
		<content:encoded><![CDATA[<p>You quoted:<br />
&#8220;Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.&#8221;<br />
As far as I know according to computer science theory, there is a difference between context-free language (which can be generated by context-free grammar) and a regular language (which can be matched using regular expression).<br />
In other words, you can&#8217;t match any context-free language with regular expression, right?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: grefel</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-774</link>
		<dc:creator>grefel</dc:creator>
		<pubDate>Fri, 22 Feb 2013 11:43:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-774</guid>
		<description><![CDATA[I had two projects involving analysing all sorts auf crappy HTML with RegEx. It works to some extend, but after a while you start to wish profoundly another Tool to solve the issue. 
Problem here: XPath ist only working with wellformed Documents working on top of a browser parser is not that quick und and simple. 
The real problem is the mess you get out of the net :-)]]></description>
		<content:encoded><![CDATA[<p>I had two projects involving analysing all sorts auf crappy HTML with RegEx. It works to some extend, but after a while you start to wish profoundly another Tool to solve the issue.<br />
Problem here: XPath ist only working with wellformed Documents working on top of a browser parser is not that quick und and simple.<br />
The real problem is the mess you get out of the net <img src='http://www.johndcook.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kl</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-773</link>
		<dc:creator>kl</dc:creator>
		<pubDate>Thu, 21 Feb 2013 17:20:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-773</guid>
		<description><![CDATA[Do you really mean well-formed HTML, or rather XHTML?

Because well-formed HTML is e.g. still allowed to omit start-tag for body, but have closing tag for it and vice-versa.

I think you could certainly tokenize it, but I can&#039;t imagine handling all the messy stack manipulation that is allowed even in valid HTML Strict.]]></description>
		<content:encoded><![CDATA[<p>Do you really mean well-formed HTML, or rather XHTML?</p>
<p>Because well-formed HTML is e.g. still allowed to omit start-tag for body, but have closing tag for it and vice-versa.</p>
<p>I think you could certainly tokenize it, but I can&#8217;t imagine handling all the messy stack manipulation that is allowed even in valid HTML Strict.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: LJ</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-772</link>
		<dc:creator>LJ</dc:creator>
		<pubDate>Thu, 21 Feb 2013 16:45:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-772</guid>
		<description><![CDATA[The problem is not whether regexes can parse HTML; the problem is whether it&#039;s a good idea to keep writing irregular expressions for HTML in toolkits that can easily &lt;a href=&quot;http://swtch.com/~rsc/regexp/regexp1.html&quot; rel=&quot;nofollow&quot;&gt;go exponential&lt;/a&gt; when used carelessly. There are already many good tools for this task; even ones that repair broken HTML while retaining (a guess at) its intended structure (tidy, LXML, BeautifulSoup, to name a few). The probability of those failing is still non-zero, but smaller than that of &lt;code&gt;htmlparsehack.pl&lt;/code&gt; failing.]]></description>
		<content:encoded><![CDATA[<p>The problem is not whether regexes can parse HTML; the problem is whether it&#8217;s a good idea to keep writing irregular expressions for HTML in toolkits that can easily <a href="http://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">go exponential</a> when used carelessly. There are already many good tools for this task; even ones that repair broken HTML while retaining (a guess at) its intended structure (tidy, LXML, BeautifulSoup, to name a few). The probability of those failing is still non-zero, but smaller than that of <code>htmlparsehack.pl</code> failing.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Derek Jones</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-771</link>
		<dc:creator>Derek Jones</dc:creator>
		<pubDate>Thu, 21 Feb 2013 15:50:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-771</guid>
		<description><![CDATA[To be exact HTML is specified by a grammar that can be written in a context free form (the syntax specification of some languages is written in a form that is not context free and some reorganization is needed to make it so, or not {as is the case with C++}).  This does not mean that it can be parsed using tools such as Bison since they require a &lt;a href=&quot;https://en.wikipedia.org/wiki/LALR_parser&quot; rel=&quot;nofollow&quot;&gt;LALR(1)&lt;/a&gt;  grammar (a subset of context free).

Can PCRE really parse a larger set of grammars than Bison or is somebody just overgeneralizing the fact that PCRE can handle languages that are a superset of regular expressions?

Life is made much simpler by pointing people at the &lt;a href=&quot;https://en.wikipedia.org/wiki/Chomsky_hierarchy&quot; rel=&quot;nofollow&quot;&gt;Chomsky hierarchy&lt;/a&gt;]]></description>
		<content:encoded><![CDATA[<p>To be exact HTML is specified by a grammar that can be written in a context free form (the syntax specification of some languages is written in a form that is not context free and some reorganization is needed to make it so, or not {as is the case with C++}).  This does not mean that it can be parsed using tools such as Bison since they require a <a href="https://en.wikipedia.org/wiki/LALR_parser" rel="nofollow">LALR(1)</a>  grammar (a subset of context free).</p>
<p>Can PCRE really parse a larger set of grammars than Bison or is somebody just overgeneralizing the fact that PCRE can handle languages that are a superset of regular expressions?</p>
<p>Life is made much simpler by pointing people at the <a href="https://en.wikipedia.org/wiki/Chomsky_hierarchy" rel="nofollow">Chomsky hierarchy</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dan</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-770</link>
		<dc:creator>dan</dc:creator>
		<pubDate>Thu, 21 Feb 2013 15:45:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-770</guid>
		<description><![CDATA[The top answer here - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - might be useful.

[Same link as in first comment. -- JC] ]]></description>
		<content:encoded><![CDATA[<p>The top answer here &#8211; <a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags" rel="nofollow">http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags</a> &#8211; might be useful.</p>
<p>[Same link as in first comment. -- JC] </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-769</link>
		<dc:creator>John</dc:creator>
		<pubDate>Thu, 21 Feb 2013 15:24:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-769</guid>
		<description><![CDATA[heltonbiker : Agreed. I&#039;m using &quot;parsing&quot; loosely here.]]></description>
		<content:encoded><![CDATA[<p>heltonbiker : Agreed. I&#8217;m using &#8220;parsing&#8221; loosely here.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: heltonbiker</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-768</link>
		<dc:creator>heltonbiker</dc:creator>
		<pubDate>Thu, 21 Feb 2013 15:12:38 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-768</guid>
		<description><![CDATA[A fundamental distinction must always be made between really PARSING html with regex versus (usually one-off, quick and dirty) SEARCHING or MATCHING some html with regex.]]></description>
		<content:encoded><![CDATA[<p>A fundamental distinction must always be made between really PARSING html with regex versus (usually one-off, quick and dirty) SEARCHING or MATCHING some html with regex.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ross Patterson</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-767</link>
		<dc:creator>Ross Patterson</dc:creator>
		<pubDate>Thu, 21 Feb 2013 14:53:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-767</guid>
		<description><![CDATA[Perhaps we should start referring to abnormalities like PCRE as &quot;irregular expression engines&quot;.  :-)]]></description>
		<content:encoded><![CDATA[<p>Perhaps we should start referring to abnormalities like PCRE as &#8220;irregular expression engines&#8221;.  <img src='http://www.johndcook.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-766</link>
		<dc:creator>John</dc:creator>
		<pubDate>Thu, 21 Feb 2013 14:37:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-766</guid>
		<description><![CDATA[Srinath: True, but PCRE can parse wcw.]]></description>
		<content:encoded><![CDATA[<p>Srinath: True, but PCRE can parse wcw.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Srinath</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-765</link>
		<dc:creator>Srinath</dc:creator>
		<pubDate>Thu, 21 Feb 2013 14:30:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-765</guid>
		<description><![CDATA[Interesting post.. But HTML is context free only if we hard code context-free expressions for all valid HTML tags. On the other hand, if we wish to parse XML (even well-formed ones) where tags can be arbitrary, even a context-free parser won&#039;t work. A well known result in theoretical computer science says that expressions of the form wcw where w is a string (not a character) is not in CFL..]]></description>
		<content:encoded><![CDATA[<p>Interesting post.. But HTML is context free only if we hard code context-free expressions for all valid HTML tags. On the other hand, if we wish to parse XML (even well-formed ones) where tags can be arbitrary, even a context-free parser won&#8217;t work. A well known result in theoretical computer science says that expressions of the form wcw where w is a string (not a character) is not in CFL..</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Samuel Jack</title>
		<link>http://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/comment-page-1/#comment-764</link>
		<dc:creator>Samuel Jack</dc:creator>
		<pubDate>Thu, 21 Feb 2013 14:00:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/?p=12983#comment-764</guid>
		<description><![CDATA[No discussion about parsing HTML with Regular Expressions is complete without reference to &lt;a href=&quot;http://stackoverflow.com/a/1732454/1727&quot; rel=&quot;nofollow&quot;&gt;this&lt;/a&gt;]]></description>
		<content:encoded><![CDATA[<p>No discussion about parsing HTML with Regular Expressions is complete without reference to <a href="http://stackoverflow.com/a/1732454/1727" rel="nofollow">this</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>
