Doing extensive text manipulation in R would be painful; the R language was developed for analyzing data sets, not for munging text files. However, R does have some facilities for working with text using regular expressions. This comes in handy, for example, when selecting rows of a data set according to regular expression pattern matches in some columns.
R supports two regular expression flavors: POSIX 1003.2 and Perl. Regular expression functions in R contain two arguments:
extended, which defaults to
perl, which defaults to
FALSE. By default R uses POSIX extended regular expressions, though if extended is set to
FALSE, it will use basic POSIX regular expressions. If
perl is set to
TRUE, R will use the Perl 5 flavor of regular expressions as implemented in the PCRE library.
Regular expressions are represented as strings. Metacharacters often need to be escaped. For example, the metacharacter
\w must be entered as
\\w to prevent R from interpreting the leading backslash before sending the string to the regular expression parser.
grep function requires two arguments. The first is a string containing a regular expression. The second is a vector of strings to search for matches. The
grep function returns a list of indicies. If the regular expression matches a particular vector component, that component’s index is part of the list.
grep("apple", c("crab apple", "Apple jack", "apple sauce"))
returns the vector (1, 3) because the first and third elements of the array contain “apple.” Note that grep is case-sensitive by default and so “apple” does not match “Apple.” To perform a case-insensitive match, add
ignore.case = TRUE to the function call.
There is an optional argument
value that defaults to
FALSE. If this argument is set to
grep will return the actual matches rather than their indices.
sub replaces one pattern with another. It requires three arguemtns: a regular expression, a replacement pattern, and a vector of strings to process. It is analogous to
s/// in Perl. Note that if you use the Perl regular expression flavor by adding
perl = TRUE and want to use capture references such as
\2 in the replacement pattern, these must be entered as
sub function replaces only the first instance of a regular expression. To replace all instances of a pattern, use
gsub function is analogous to
s///g in Perl.
regexpr requires two arguments, a regular expression and a vector of text to process. It is similar to
grep, but returns the locations of the regular expression matches. If a particular component does not match the regular expression, the return vector contains a -1 for that component. The function
gregexpr is a variation on
regexpr that returns the number of matches in each component.
strsplit also uses regular expressions, splitting its input according to a specified regular expression.
Notes on using regular expressions in other languages: