This article is written for the benefit of someone familiar with regular expressions but not with the use of regular expressions in C++ via the TR1 (C++ Standards Committee Technical Report 1) extensions. Comparisons will be made with Perl for those familiar with Perl, though no knowledge of Perl is required. The focus is not on the syntax of regular expressions per se but rather how to use regular expressions to search for patterns and make replacements.
The C++ TR1 regular expression specification has an intimidating array of options. This article is intended to get you started, not to explore every nook and cranny. Getting started is the harder part since it’s easier to find API details than basic examples.
The examples below use fully qualified namespaces for clarity. You could make your code more succinct by adding a few
using statements to eliminate namespace qualifiers.
The C++ TR1 regular expressions can follow the syntax of several regular expression environments depending on the optional flags sent to the regular expression class constructor. The six options given in the Microsoft implementation are as follows.
The default for the Microsoft implementation is
The choice of flavors is extensible and implementation-specific. For example, the Boost implementation adds
perl as an option, which presumably follows Perl 5 syntax more closely than the
ECMASCript option does.
For someone familiar with regular expressions the difficulty in using regular expressions in C++ TR1 is not in the syntax of regular expressions themselves, but rather in using regular expressions to do work.
The C++ regular expression functions are defined in the
<regex> header and contained in the namespace
std::tr1. Note that
tr is lowercase in C++. In English prose “TR” is capitalized.
The first surprise you may run into with the C++ regular expression implementation is that
regex_match does not “match” in the usual sense. It will return true only when the entire string matches the regular expression. The function
regex_search works more like the match operator in other environments, such as the
m// operator in Perl.
regex_search start with a C++ string
std::string str = "Hello world";
and construct a regular expression
regex_match(str.begin(), str.end(), rx)
false because the string
str contains more character beyond the match of the regular expression
regex_search(str.begin(), str.end(), rx)
true because the regular expression matches a substring of
After performing a match in Perl, the captured matches are stored in the variables
$2, etc. Similarly, after a C++ places matches in a
match_result object. However, while Perl always creates
$nvariables, C++ does not store matches unless you call an overloaded form of
regex_search that takes a
match_result object. The class
match_result is a template; often people use the class
cmatch defined by
typedef match_results<const char*> cmatch
The following example shows how retrieve captured matches.
std::tr1::cmatch res; str = "<h2>Egg prices</h2>"; std::tr1::regex rx("<h(.)>([^<]+)"); std::tr1::regex_search(str.c_str(), res, rx); std::cout << res << ". " << res << "\n";
The code above will output
2. Egg prices
] corresponds to Perl’s
The following code will replace “world” in the string “Hello world” with “planet”. The string
str2 will contain “Hello planet” and the string
str will remain unchanged.
std::string str = "Hello world"; std::tr1::regex rx("world"); std::string replacement = "planet"; std::string str2 = std::tr1::regex_replace(str, rx, replacement);
regex_replace does not change its arguments, unlike the Perl command
Note also that the third argument to
regex_replace must be a
string class and not a string literal. You could, however, eliminate the temporary variable
replacement by changing the call to
regex_replace with a string literal cast to a
regex_replace(str, rx, std::string("planet"))
By default, all instances of the pattern that match the regular expression are replaced. In the example above, if
str had been
"Hello world world" the result would have been
"Hello planet planet". To replace only the first instance (to produce
"Hello planet world" you would need to add the flag
as the fourth argument to
Because the default behavior of
regex_replaceis a global replace, the function is analogous to the
s///g operator in Perl. With the
format_first_only flag the function is analogous to the unmodified
s/// Perl operator.
Regular expression processing is not as convenient in C++ as it is in languages such as Perl that have built-in regular expression support. One reason is escape sequences. To send a backslash
\ to the regular expression engine, you have to type
\\ in the source code. For example, consider these definitions.
std::string str = "Hello\tworld"; std::tr1::regex rx("o\\tw");
str contains a tab character between the
o and the
w. The regular expression
rx does not contain a tab character; it contains
\t, the regular expression syntax for matching a tab character.
C++ regular expressions are case-sensitive by default, as in Perl and many other environments. To specify that a regular expression is case-insensitive, add the flag
std::tr1::regex_constants::icase as a second argument to the
regex constructor. (The constructor flags can be combined with a bit-wise. So if you’re specifying a flag for the regular expression flavor, you can follow it with
| icase to combine the two.)
Support for case-sensitivity highlights the differences between C++ and scripting languages. C++ allows more control over regular expressions but also requires more input. For example, Perl makes the
m// (match) and
s/// (replace) operators case-insensitive by simply appending an
i. While the regular expression syntax in C++ is more cluttered than that of scripting languages, people who use C++ are doing so because they value control over succinct syntax.
If you have trouble linking with the regex library in Visual Studio 2008, this post may help.
Other C++ articles:
Using regular expressions in other languages: