The R language was developed for analyzing data sets, not for munging text files. However, R does have some facilities for working with text using regular expressions. This comes in handy, for example, when selecting rows of a data set according to regular expression pattern matches in some columns.
R supports two regular expression flavors: POSIX 1003.2 and Perl. Regular expression functions in R contain two arguments: extended
, which defaults to TRUE
, and perl
, which defaults to FALSE
. By default R uses POSIX extended regular expressions, though if extended is set to FALSE
, it will use basic POSIX regular expressions. If perl
is set to TRUE
, R will use the Perl 5 flavor of regular expressions as implemented in the PCRE library.
Regular expressions are represented as strings. Metacharacters often need to be escaped. For example, the metacharacter \w
must be entered as \\w
to prevent R from interpreting the leading backslash before sending the string to the regular expression parser.
The grep
function requires two arguments. The first is a string containing a regular expression. The second is a vector of strings to search for matches. The grep
function returns a list of indices. If the regular expression matches a particular vector component, that component’s index is part of the list.
Example:
grep("apple", c("crab apple", "Apple jack", "apple sauce"))
returns the vector (1, 3) because the first and third elements of the array contain “apple.” Note that grep is case-sensitive by default and so “apple” does not match “Apple.” To perform a case-insensitive match, add ignore.case = TRUE
to the function call.
There is an optional argument value
that defaults to FALSE
. If this argument is set to TRUE
, grep
will return the actual matches rather than their indices.
The function sub
replaces one pattern with another. It requires three arguemtns: a regular expression, a replacement pattern, and a vector of strings to process. It is analogous to s///
in Perl. Note that if you use the Perl regular expression flavor by adding perl = TRUE
and want to use capture references such as \1
or \2
in the replacement pattern, these must be entered as \\1
or \\2
.
The sub
function replaces only the first instance of a regular expression. To replace all instances of a pattern, use gsub
. The gsub
function is analogous to s///g
in Perl.
The function regexpr
requires two arguments, a regular expression and a vector of text to process. It is similar to grep
, but returns the locations of the regular expression matches. If a particular component does not match the regular expression, the return vector contains a -1 for that component. The function gregexpr
is a variation on regexpr
that returns the number of matches in each component.
The function strsplit
also uses regular expressions, splitting its input according to a specified regular expression.
Resources
Notes on using regular expressions in other languages:
See also: