grep                  package:base                  R Documentation

_P_a_t_t_e_r_n _M_a_t_c_h_i_n_g _a_n_d _R_e_p_l_a_c_e_m_e_n_t

_D_e_s_c_r_i_p_t_i_o_n:

     'grep' searches for matches to 'pattern' (its first argument)
     within the character vector 'x' (second argument). 'regexpr' and
     'gregexpr' do too, but return more detail in a different format.

     'sub' and 'gsub' perform replacement of matches determined by
     regular expression matching.

_U_s_a_g_e:

     grep(pattern, x, ignore.case = FALSE, extended = TRUE,
          perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE)

     sub(pattern, replacement, x,
         ignore.case = FALSE, extended = TRUE, perl = FALSE,
         fixed = FALSE, useBytes = FALSE)

     gsub(pattern, replacement, x,
          ignore.case = FALSE, extended = TRUE, perl = FALSE,
          fixed = FALSE, useBytes = FALSE)

     regexpr(pattern, text, ignore.case = FALSE, extended = TRUE,
             perl = FALSE, fixed = FALSE, useBytes = FALSE)

     gregexpr(pattern, text, ignore.case = FALSE, extended = TRUE,
              perl = FALSE, fixed = FALSE, useBytes = FALSE)

_A_r_g_u_m_e_n_t_s:

 pattern: character string containing a regular expression (or
          character string for 'fixed = TRUE') to be matched in the
          given character vector.  Coerced by 'as.character' to a
          character string if possible.

 x, text: a character vector where matches are sought, or an object
          which can be coerced by 'as.character' to a character vector.

ignore.case: if 'FALSE', the pattern matching is _case sensitive_ and
          if 'TRUE', case is ignored during matching.

extended: if 'TRUE', extended regular expression matching is used, and
          if 'FALSE' basic regular expressions are used.

    perl: logical. Should perl-compatible regexps be used? Has priority
          over 'extended'.

   value: if 'FALSE', a vector containing the ('integer') indices of
          the matches determined by 'grep' is returned, and if 'TRUE',
          a vector containing the matching elements themselves is
          returned.

   fixed: logical.  If 'TRUE', 'pattern' is a string to be matched as
          is.  Overrides all conflicting arguments.

useBytes: logical.  If 'TRUE' the matching is done byte-by-byte rather
          than character-by-character.  See 'Details'.

replacement: a replacement for matched pattern in 'sub' and 'gsub'. 
          Coerced to character if possible.  For 'fixed = FALSE' this
          can include backreferences '"\1"' to '"\9"' to parenthesized
          subexpressions of 'pattern'.  For 'perl = TRUE' only, it can
          also contain '"\U"' or '"\L"' to convert the rest of the
          replacement to upper or lower case. 

_D_e_t_a_i_l_s:

     Arguments which should be character strings or character vectors
     are coerced to character if possible.

     The two '*sub' functions differ only in that 'sub' replaces only
     the first occurrence of a 'pattern' whereas 'gsub' replaces all
     occurrences.

     For 'regexpr' it is an error for 'pattern' to be 'NA', otherwise
     'NA' is permitted and gives an 'NA' match.

     The regular expressions used are those specified by POSIX 1003.2,
     either extended or basic, depending on the value of the 'extended'
     argument, unless 'perl = TRUE' when they are those of PCRE, <URL:
     http://www.pcre.org/>. (The exact set of patterns supported may
     depend on the version of PCRE installed on the system in use if R
     was configured to use the system PCRE.)

     'useBytes' is only used if 'fixed = TRUE' or 'perl = TRUE'. Its
     main effect is to avoid errors/warnings about invalid inputs and
     spurious matches, but for 'regexpr' it changes the interpretation
     of the output.

     PCRE only supports caseless matching for a non-ASCII pattern in a
     UTF-8 locale (and not for 'useBytes = TRUE' in any locale).

_V_a_l_u_e:

     For 'grep' a vector giving either the indices of the elements of
     'x' that yielded a match or, if 'value' is 'TRUE', the matched
     elements of 'x' (after coercion, preserving names but no other
     attributes).

     For 'sub' and 'gsub' a character vector of the same length and
     with the same attributes as 'x' (after possible coercion).
     Elements of character vectors 'x' which are not substituted will
     be return unchanged (including any declared encoding).  If
     'useBytes = FALSE', either 'perl = TRUE' or 'fixed = TRUE' and any
     element of 'pattern', 'replacement' and 'x' is declared to be in
     UTF-8, the result will be in UTF-8. Otherwise changed elements of
     the result will be have the encoding declared as that of the
     current locale (see 'Encoding' if the corresponding input had a
     declared encoding and the current locale is either Latin-1 or
     UTF-8.

     For 'regexpr' an integer vector of the same length as 'text'
     giving the starting position of the first match, or -1 if there is
     none, with attribute '"match.length"' giving the length of the
     matched text (or -1 for no match).  In a multi-byte locale these
     quantities are in characters rather than bytes unless 'useBytes =
     TRUE' is used with 'fixed = TRUE' or 'perl = TRUE'.

     For 'gregexpr' a list of the same length as 'text' each element of
     which is an integer vector as in 'regexpr', except that the
     starting positions of every (disjoint) match are given.

     If in a multi-byte locale the pattern or replacement is not a
     valid sequence of bytes, an error is thrown.  An invalid string in
     'x' or 'text' is a non-match with a warning for 'grep' or
     'regexpr', but an error for 'sub' or 'gsub'.

_W_a_r_n_i_n_g:

     The standard regular-expression code has been reported to be very
     slow when applied to extremely long character strings (tens of
     thousands of characters or more): the code used when 'perl = TRUE'
     seems much faster and more reliable for such usages.

     The standard version of 'gsub' does not substitute correctly
     repeated word-boundaries (e.g. 'pattern = "\b"'). Use 'perl =
     TRUE' for such matches.

     The 'perl = TRUE' option is only implemented for single-byte and
     UTF-8 encodings, and will warn if used in a non-UTF-8 multi-byte
     locale (unless 'useBytes = TRUE').

_R_e_f_e_r_e_n_c_e_s:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_. Wadsworth & Brooks/Cole ('grep')

_S_e_e _A_l_s_o:

     regular expression (aka 'regexp') for the details of the pattern
     specification.

     'glob2rx' to turn wildcard matches into regular expressions.

     'agrep' for approximate matching.

     'tolower', 'toupper' and 'chartr' for character translations.
     'charmatch', 'pmatch', 'match'. 'apropos' uses regexps and has
     nice examples.

_E_x_a_m_p_l_e_s:

     grep("[a-z]", letters)

     txt <- c("arm","foot","lefroo", "bafoobar")
     if(length(i <- grep("foo",txt)))
        cat("'foo' appears at least once in\n\t",txt,"\n")
     i # 2 and 4
     txt[i]

     ## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
     gsub("([ab])", "\\1_\\1_", "abc and ABC")

     txt <- c("The", "licenses", "for", "most", "software", "are",
       "designed", "to", "take", "away", "your", "freedom",
       "to", "share", "and", "change", "it.",
        "", "By", "contrast,", "the", "GNU", "General", "Public", "License",
        "is", "intended", "to", "guarantee", "your", "freedom", "to",
        "share", "and", "change", "free", "software", "--",
        "to", "make", "sure", "the", "software", "is",
        "free", "for", "all", "its", "users")
     ( i <- grep("[gu]", txt) ) # indices
     stopifnot( txt[i] == grep("[gu]", txt, value = TRUE) )

     ## Note that in locales such as en_US this includes B as the
     ## collation order is aAbBcCdEe ...
     (ot <- sub("[b-e]",".", txt))
     txt[ot != gsub("[b-e]",".", txt)]#- gsub does "global" substitution

     txt[gsub("g","#", txt) !=
         gsub("g","#", txt, ignore.case = TRUE)] # the "G" words

     regexpr("en", txt)

     gregexpr("e", txt)

     ## trim trailing white space
     str <- 'Now is the time      '
     sub(' +$', '', str)  ## spaces only
     sub('[[:space:]]+$', '', str) ## white space, POSIX-style
     sub('\\s+$', '', str, perl = TRUE) ## Perl-style white space

     ## capitalizing
     gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", "a test of capitalizing", perl=TRUE)
     gsub("\\b(\\w)", "\\U\\1", "a test of capitalizing", perl=TRUE)

