Regular expressions in C++ TR1

Regular expressions are not a part of the C++ Standard Library quite yet, but there is a document (Technical Report 1, or TR1) that includes among other things a specification for regular expression support that will probably be added to the C++ standard eventually.

The Boost library has supported TR1 for a while. Microsoft just released a feature pack for Visual Studio 2008 a month ago that includes support for most of TR1. (They’ve left out support for mathematical special functions.) And Dinkumware sells a complete TR1 implementation.

I’ve added some notes to my website for getting started with C++ TR1 regular expressions. I took my PowerShell regex notes as a starting point and implemented some of the same examples in C++. I changed the organization though, because the C++ implementation is fairly different from PowerShell.

Working with regular expressions is harder in C++ than in scripting languages such as Perl or Python, but not unnecessarily so. C++ is optimized for fine-grained control and efficiency rather than ease of use; that’s what C++ is for. The TR1 implementation is internally consistent and elegant in its own way.

It’s easy to find API-level documentation but harder to find examples for getting started. (I’ve heard good things about Pete Becker’s book The C++ Standard Library Extensions but I haven’t read it.) So I decided to keep some notes as I played with the Visual Studio implementation. I imagine most of the content applies to other implementations, but I’ve only tested the examples using Visual Studio.

Update: GCC just added support for C++ TR1 two days ago with their version 4.3 release.  However, it appears support for regular expressions is not included.

LINQ to Regex

Roy Osherove just posted an article about his Introducing LINQ to Regex project.

LINQ stands for Language INtegrated Query, a way of baking query support into .NET programming languages. Microsoft has been promising a unified way to query all kinds of data for years now.  Along the way they came out with a score of new libraries that were going to be the solution. They’d work for all kinds of data that happened to look very much like a relational database. But now with LINQ they’ve finally delivered something that works well not only with relational data but also with hierarchical data such as XML. With LINQ to Regex, you can query unstructured text with LINQ as well.

There are two big advantages to LINQ. First, you can query different kinds of data sources with similar code. Second, “language integrated” means that your programming language knows about your query language, making strong typing and better tool support possible. (By contrast, if you have a SQL statement inside VB, for example, VB knows nothing about SQL. The SQL command is just a string as far as VB is concerned. If the SQL is malformed, you won’t know until runtime. But with LINQ, malformed queries generate compile errors.)

Update: See Scott Hanselman’s discussion of LINQ to Regex.

Readable path listings

Windows has never made it easy to read long environment variables. If I display the path on one machine I get something like this, both from cmd and from PowerShell.

C:bin;C:binPython25;C:binTeXmiktexbin;C:binTeXMiKTeXmiktexbin;C:binPerlbin;C:ProgramFilesCompaqCompaq Management AgentsDmiWin32Bin; ...

The System Properties window is worse since you can only see a tiny slice of your path at a time.

screen shot of path UI

Here’s a PowerShell one-liner to produce readable path listing:

$env:path -replace ";", "`n"

This produces

C:bin
C:binPython25
C:binTeXmiktexbin
C:binTeXMiKTeXmiktexbin
C:binPerlbin
C:Program FilesCompaqCompaq Management AgentsDmiWin32Bin
...

(If you’re not familiar with PowerShell, note the backquote before the n to indicate the newline character to replace semicolons. This is one of the most unconventional features of PowerShell since backslash is the escape character in most contexts. Because Windows uses either forward or backward slashes as path separators, PowerShell could not use backslash as an escape character. Think of the backquote as a little backslash. Once you get over the initial shock, you get used to the backquote quickly.)

Update: It occurred to me after the original post that there’s an even simpler way to display the path.

$env:path.split(';')

Tips for learning regular expressions

Here are a few realizations that helped me the most when I was learning regular expressions.

1. Regular expressions aren’t trivial. If you think they’re trivial, but you can’t get them to work, then you feel stupid. They’re not trivial, but they’re not that hard either. They just take some study.

2. Regular expressions are not command line wild cards. They contain some of the same symbols but they don’t mean the same thing. They’re just similar enough to cause confusion.

3. Regular expressions are a little programming language. Regular expressions are usually contained inside another programming language, like JavaScript or PowerShell. Think of the expressions as little bits of a foreign language, like a French quotation inside English prose. Don’t expect rules from the outside language to have any relation to the rules inside, no more than you’d expect English grammar to apply inside that French quote.

4. Character classes are a little sub-language within regular expressions. Character classes are their own little world. Once you realize that and don’t expect the usual rules for regular expressions outside character classes to apply, you can see that they’re not very complicated, just different. Failure to realize that they are different is a major source of bugs.

Once you’re ready to dive into regular expressions, read Jeffrey Friedl’s book (ISBN 0596528124). It’s by far the best book on the subject. Read the first few chapters carefully, but then flip the pages quickly when he goes off into NFA engines and all that.

***

For daily tips on regular expressions, follow @RegexTip on Twitter.