Regular expressions in R
Doing extensive text manipulation in R would be painful; the R language was developed for analyzing data sets, not for munging text files. However, R does have some facilities for working with text using regular expressions. This comes in handy, for example, when selecting rows of a data set according to regular expression pattern matches in some columns.
R supports two regular expression flavors: POSIX 1003.2 and Perl.
Regular expression functions in R contain two arguments:
perl, which defaults to
FALSE. By default R uses
POSIX extended regular expressions, though if extended is set to
it will use basic POSIX regular expressions. If
perl is set to
will use the Perl 5 flavor of regular expressions as implemented in the
Regular expressions are represented as strings. Metacharacters often
need to be escaped. For example, the metacharacter
\w must be entered as
\\w to prevent R from interpreting the leading backslash before sending
the string to the regular expression parser.
grep function requires two arguments. The first is a string
containing a regular expression. The second is a vector of strings to
search for matches. The
grep function returns a list of indicies. If the
regular expression matches a particular vector component, that
component's index is part of the list.
grep("apple", c("crab apple", "Apple jack", "apple sauce"))
returns the vector (1, 3) because the first and third elements of the
array contain "apple." Note that grep is case-sensitive by default and
so "apple" does not match "Apple." To perform a case-insensitive match,
ignore.case = TRUE to the function call.
There is an optional argument
value that defaults to
FALSE. If this
argument is set to
grep will return the actual matches rather than
sub replaces one pattern with another. It requires three
arguemtns: a regular expression, a replacement pattern, and a vector of
strings to process. It is analogous to
s/// in Perl. Note that if you
use the Perl regular expression flavor by adding
perl = TRUE and want to
use capture references such as
\2 in the replacement pattern,
these must be entered as
sub function replaces only the first instance of a regular
expression. To replace all instances of a pattern, use
gsub function is analogous to
s///g in Perl.
regexpr requires two arguments, a regular expression and a
vector of text to process. It is similar to
grep, but returns the
locations of the regular expression matches. If a particular component
does not match the regular expression, the return vector contains a -1
for that component. The function
gregexpr is a variation on
returns the number of matches in each component.
strsplit also uses regular expressions, splitting its input
according to a specified regular expression.
Notes on using regular expressions in other languages: