C++ TR1 regular expressions

Overview
Header and namespace
C++ regular expression flavor
Matching
Retrieving matches
Replacing matches
Escape sequences
Case-sensitivity
Troubleshooting

Overview

This article is written for the benefit of someone familiar with regular expressions but not with the use of regular expressions in C++ via the TR1 (C++ Standards Committee Technical Report 1) extensions. Comparisons will be made with Perl for those familiar with Perl, though no knowledge of Perl is required. The focus is not on the syntax of regular expressions per se but rather how to use regular expressions to search for patterns and make replacements.

Support for TR1 extensions in Visual Studio 2008 is added as a feature pack. It is also included in Visual Studio 2010. Other implementations include the Boost and Dinkumware.

The C++ TR1 regular expression specification has an intimidating array of options. This article is intended to get you started, not to explore every nook and cranny. Getting started is the harder part since it’s easier to find API details than basic examples.

The examples below use fully qualified namespaces for clarity. You could make your code more succinct by adding a few using statements to eliminate namespace qualifiers.

C++ TR1 regular expression flavor

The C++ TR1 regular expressions can follow the syntax of several regular expression environments depending on the optional flags sent to the regular expression class constructor. The six options given in the Microsoft implementation are as follows.

  • basic
  • extended
  • ECMAScript
  • awk
  • grep
  • egrep

The default for the Microsoft implementation is ECMAScript, matching the regular expression syntax of the ECMAScript (JavaScript) language, which is very similar to that in Perl 5.

The choice of flavors is extensible and implementation-specific. For example, the Boost implementation adds perl as an option, which presumably follows Perl 5 syntax more closely than the ECMASCript option does.

For someone familiar with regular expressions the difficulty in using regular expressions in C++ TR1 is not in the syntax of regular expressions themselves, but rather in using regular expressions to do work.

Header and namespace

The C++ regular expression functions are defined in the <regex> header and contained in the namespace std::tr1. Note that tr is lowercase in C++. In English prose “TR” is capitalized.

Matching

The first surprise you may run into with the C++ regular expression implementation is that regex_match does not “match” in the usual sense. It will return true only when the entire string matches the regular expression. The function regex_search works more like the match operator in other environments, such as the m// operator in Perl.

To illustrate regex_match and regex_search start with a C++ string

        
    std::string str = "Hello world";
        

and construct a regular expression

        
    std::tr1::regex rx("ello");
        

The expression

        
    regex_match(str.begin(), str.end(), rx)
        

will return false because the string str contains more character beyond the match of the regular expression rx. However

        
    regex_search(str.begin(), str.end(), rx)
        

will return true because the regular expression matches a substring of str.

Retrieving matches

After performing a match in Perl, the captured matches are stored in the variables $1, $2, etc. Similarly, after a C++ places matches in a match_result object. However, while Perl always creates $nvariables, C++ does not store matches unless you call an overloaded form of regex_search that takes a match_result object. The class match_result is a template; often people use the class cmatch defined by

        
    typedef match_results<const char*> cmatch
        

The following example shows how retrieve captured matches.

        
    std::tr1::cmatch res;
    str = "<h2>Egg prices</h2>";
    std::tr1::regex rx("<h(.)>([^<]+)");
    std::tr1::regex_search(str.c_str(), res, rx);
    std::cout << res[1] << ". " << res[2] << "\n";
        

The code above will output

        
    2. Egg prices
        

Note that res[n] corresponds to Perl’s $n.

Replacing matches

The following code will replace “world” in the string “Hello world” with “planet”. The string str2 will contain “Hello planet” and the string str will remain unchanged.

        
    std::string str = "Hello world";
    std::tr1::regex rx("world");
    std::string replacement = "planet";
    std::string str2 = std::tr1::regex_replace(str, rx, replacement);
        

Note that regex_replace does not change its arguments, unlike the Perl command s/world/planet/.

Note also that the third argument to regex_replace must be a string class and not a string literal. You could, however, eliminate the temporary variable replacement by changing the call to regex_replace with a string literal cast to a string.

        
    regex_replace(str, rx, std::string("planet"))
        

By default, all instances of the pattern that match the regular expression are replaced. In the example above, if str had been "Hello world world" the result would have been "Hello planet planet". To replace only the first instance (to produce "Hello planet world" you would need to add the flag

        
    std::tr1::regex_constants::format_first_only
        

as the fourth argument to regex_replace.

Because the default behavior of regex_replaceis a global replace, the function is analogous to the s///g operator in Perl. With the format_first_only flag the function is analogous to the unmodified s/// Perl operator.

Escape sequences

Regular expression processing is not as convenient in C++ as it is in languages such as Perl that have built-in regular expression support. One reason is escape sequences. To send a backslash \ to the regular expression engine, you have to type \\ in the source code. For example, consider these definitions.

        
    std::string str = "Hello\tworld";
    std::tr1::regex rx("o\\tw");
        

The string str contains a tab character between the o and the w. The regular expression rx does not contain a tab character; it contains \t, the regular expression syntax for matching a tab character.

Case-sensitivity

C++ regular expressions are case-sensitive by default, as in Perl and many other environments. To specify that a regular expression is case-insensitive, add the flag std::tr1::regex_constants::icase as a second argument to the regex constructor. (The constructor flags can be combined with a bit-wise. So if you’re specifying a flag for the regular expression flavor, you can follow it with | icase to combine the two.)

Support for case-sensitivity highlights the differences between C++ and scripting languages. C++ allows more control over regular expressions but also requires more input. For example, Perl makes the m// (match) and s/// (replace) operators case-insensitive by simply appending an i. While the regular expression syntax in C++ is more cluttered than that of scripting languages, people who use C++ are doing so because they value control over succinct syntax.

Troubleshooting

If you have trouble linking with the regex library in Visual Studio 2008, this post may help.

Further resources

Other C++ articles:

Using regular expressions in other languages:

Daily tips on regular expressions