Overview
Header and namespace
C++ regular expression flavor
Matching
Retrieving matches
Replacing matches
Escape sequences
Case-sensitivity
Troubleshooting
Overview
This article is written for the benefit of someone familiar with regular expressions but not with the use of regular expressions in C++ via the TR1 (C++ Standards Committee Technical Report 1) extensions. Comparisons will be made with Perl for those familiar with Perl, though no knowledge of Perl is required. The focus is not on the syntax of regular expressions per se but rather how to use regular expressions to search for patterns and make replacements.
Support for TR1 extensions in Visual Studio 2008 is added as a feature pack. It is also included in Visual Studio 2010. Other implementations include the Boost and Dinkumware.
The C++ TR1 regular expression specification has an intimidating array of options. This article is intended to get you started, not to explore every nook and cranny. Getting started is the harder part since it’s easier to find API details than basic examples.
The examples below use fully qualified namespaces for clarity. You could make your code more succinct by adding a few using
statements to eliminate namespace qualifiers.
C++ TR1 regular expression flavor
The C++ TR1 regular expressions can follow the syntax of several regular expression environments depending on the optional flags sent to the regular expression class constructor. The six options given in the Microsoft implementation are as follows.
basic
extended
ECMAScript
awk
grep
egrep
The default for the Microsoft implementation is ECMAScript
, matching the regular expression syntax of the ECMAScript (JavaScript) language, which is very similar to that in Perl 5.
The choice of flavors is extensible and implementation-specific. For example, the Boost implementation adds perl
as an option, which presumably follows Perl 5 syntax more closely than the ECMASCript
option does.
For someone familiar with regular expressions the difficulty in using regular expressions in C++ TR1 is not in the syntax of regular expressions themselves, but rather in using regular expressions to do work.
Header and namespace
The C++ regular expression functions are defined in the <regex>
header and contained in the namespace std::tr1
. Note that tr
is lowercase in C++. In English prose “TR” is capitalized.
Matching
The first surprise you may run into with the C++ regular expression implementation is that regex_match
does not “match” in the usual sense. It will return true only when the entire string matches the regular expression. The function regex_search
works more like the match operator in other environments, such as the m//
operator in Perl.
To illustrate regex_match
and regex_search
start with a C++ string
std::string str = "Hello world";
and construct a regular expression
std::tr1::regex rx("ello");
The expression
regex_match(str.begin(), str.end(), rx)
will return false
because the string str
contains more character beyond the match of the regular expression rx
. However
regex_search(str.begin(), str.end(), rx)
will return true
because the regular expression matches a substring of str
.
Retrieving matches
After performing a match in Perl, the captured matches are stored in the variables $1
, $2
, etc. Similarly, after a C++ places matches in a match_result
object. However, while Perl always creates $
nvariables, C++ does not store matches unless you call an overloaded form of regex_search
that takes a match_result
object. The class match_result
is a template; often people use the class cmatch
defined by
typedef match_results<const char*> cmatch
The following example shows how retrieve captured matches.
std::tr1::cmatch res; str = "<h2>Egg prices</h2>"; std::tr1::regex rx("<h(.)>([^<]+)"); std::tr1::regex_search(str.c_str(), res, rx); std::cout << res[1] << ". " << res[2] << "\n";
The code above will output
2. Egg prices
Note that res[
n]
corresponds to Perl’s $
n.
Replacing matches
The following code will replace “world” in the string “Hello world” with “planet”. The string str2
will contain “Hello planet” and the string str
will remain unchanged.
std::string str = "Hello world"; std::tr1::regex rx("world"); std::string replacement = "planet"; std::string str2 = std::tr1::regex_replace(str, rx, replacement);
Note that regex_replace
does not change its arguments, unlike the Perl command s/world/planet/
.
Note also that the third argument to regex_replace
must be a string
class and not a string literal. You could, however, eliminate the temporary variable replacement
by changing the call to regex_replace
with a string literal cast to a string
.
regex_replace(str, rx, std::string("planet"))
By default, all instances of the pattern that match the regular expression are replaced. In the example above, if str
had been "Hello world world"
the result would have been "Hello planet planet"
. To replace only the first instance (to produce "Hello planet world"
you would need to add the flag
std::tr1::regex_constants::format_first_only
as the fourth argument to regex_replace
.
Because the default behavior of regex_replace
is a global replace, the function is analogous to the s///g
operator in Perl. With the format_first_only
flag the function is analogous to the unmodified s///
Perl operator.
Escape sequences
Regular expression processing is not as convenient in C++ as it is in languages such as Perl that have built-in regular expression support. One reason is escape sequences. To send a backslash \
to the regular expression engine, you have to type \\
in the source code. For example, consider these definitions.
std::string str = "Hello\tworld"; std::tr1::regex rx("o\\tw");
The string str
contains a tab character between the o
and the w
. The regular expression rx
does not contain a tab character; it contains \t
, the regular expression syntax for matching a tab character.
Case-sensitivity
C++ regular expressions are case-sensitive by default, as in Perl and many other environments. To specify that a regular expression is case-insensitive, add the flag std::tr1::regex_constants::icase
as a second argument to the regex
constructor. (The constructor flags can be combined with a bit-wise. So if you’re specifying a flag for the regular expression flavor, you can follow it with | icase
to combine the two.)
Support for case-sensitivity highlights the differences between C++ and scripting languages. C++ allows more control over regular expressions but also requires more input. For example, Perl makes the m//
(match) and s///
(replace) operators case-insensitive by simply appending an i
. While the regular expression syntax in C++ is more cluttered than that of scripting languages, people who use C++ are doing so because they value control over succinct syntax.
Troubleshooting
If you have trouble linking with the regex library in Visual Studio 2008, this post may help.
Further resources
Other C++ articles:
Using regular expressions in other languages: