grep                  package:base                  R Documentation

_P_a_t_t_e_r_n _M_a_t_c_h_i_n_g _a_n_d _R_e_p_l_a_c_e_m_e_n_t

_D_e_s_c_r_i_p_t_i_o_n:

     'grep' searches for matches to 'pattern' (its first argument)
     within the character vector 'x' (second argument).  'regexpr' does
     too, but returns more detail in a different format.

     'sub' and 'gsub' perform replacement of matches determined by
     regular expression matching.

_U_s_a_g_e:

     grep(pattern, x, ignore.case = FALSE, extended = TRUE, perl = FALSE,
          value = FALSE, fixed = FALSE, useBytes = FALSE)

     sub(pattern, replacement, x,
         ignore.case = FALSE, extended = TRUE, perl = FALSE, fixed = FALSE)

     gsub(pattern, replacement, x,
          ignore.case = FALSE, extended = TRUE, perl = FALSE, fixed = FALSE)

     regexpr(pattern, text, extended = TRUE, perl = FALSE, fixed = FALSE,
             useBytes = FALSE)

_A_r_g_u_m_e_n_t_s:

 pattern: character string containing a regular expression (or
          character string for 'fixed = TRUE') to be matched in the
          given character vector.  Coerced to character if possible.

 x, text: a character vector where matches are sought. Coerced to
          character if possible.

ignore.case: if 'FALSE', the pattern matching is _case sensitive_ and
          if 'TRUE', case is ignored during matching.

extended: if 'TRUE', extended regular expression matching is used, and
          if 'FALSE' basic regular expressions are used.

    perl: logical. Should perl-compatible regexps be used? Has priority
          over 'extended'.

   value: if 'FALSE', a vector containing the ('integer') indices of
          the matches determined by 'grep' is returned, and if 'TRUE',
          a vector containing the matching elements themselves is
          returned.

   fixed: logical.  If 'TRUE', 'pattern' is a string to be matched as
          is.  Overrides all conflicting arguments.

useBytes: logical.  If 'TRUE' the matching is done byte-by-byte rather
          than character-by-character.  See Details.

replacement: a replacement for matched pattern in 'sub' and 'gsub'. 
          Coerced to character if possible.  This can include
          backreferences '"\1"' to '"\9"' to parenthesized
          subexpressions of 'pattern'.

_D_e_t_a_i_l_s:

     Arguments which should be character strings or character vectors
     are coerced to character if possible.

     The two '*sub' functions differ only in that 'sub' replaces only
     the first occurrence of a 'pattern' whereas 'gsub' replaces all
     occurrences.

     For 'regexpr' it is an error for 'pattern' to be 'NA', otherwise
     'NA' is permitted and matches only itself.

     The regular expressions used are those specified by POSIX 1003.2,
     either extended or basic, depending on the value of the 'extended'
     argument, unless 'perl = TRUE' when they are those of PCRE, <URL:
     ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/>. (The
     exact set of patterns supported may depend on the version of PCRE
     installed on the system in use.)

     'useBytes' is only used if 'fixed = TRUE' or 'perl = TRUE'. For
     'grep' its main effect is to avoid errors/warnings about invalid
     inputs, but for 'regexpr' it changes the interpretation of the
     output.

_V_a_l_u_e:

     For 'grep' a vector giving either the indices of the elements of
     'x' that yielded a match or, if 'value' is 'TRUE', the matched
     elements.

     For 'sub' and 'gsub' a character vector of the same length as the
     original.

     For 'regexpr' an integer vector of the same length as 'text'
     giving the starting position of the first match, or -1 if there is
     none, with attribute '"match.length"' giving the length of the
     matched text (or -1 for no match).  In a multi-byte locale these
     quantities are in characters rather than bytes unless 'useBytes =
     TRUE' is used with 'fixed = TRUE' or 'perl = TRUE'.

     If in a multi-byte locale the pattern or replacement is not a
     valid sequence of bytes, an error is thrown.  An invalid string in
     'x' or 'text' is a non-match with a warning for 'grep' or
     'regexpr', but an error for 'sub' or 'gsub'.

_W_a_r_n_i_n_g:

     The standard regular-expression code has been reported to be very
     slow when applied to extremely long character strings (tens of
     thousands of characters or more): the code used when 'perl = TRUE'
     seems much faster and more reliable for such usages.

     The standard version of 'gsub' does not substitute correctly
     repeated word-boundaries (e.g. 'pattern = "\b"'). Use 'perl =
     TRUE' for such matches.

     The 'perl = TRUE' option is only implemented for single-byte and
     UTF-8 encodings, and will warn if used in a non-UTF-8 multi-byte
     locale (unless 'useBytes = FALSE').

_R_e_f_e_r_e_n_c_e_s:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_. Wadsworth & Brooks/Cole ('grep')

_S_e_e _A_l_s_o:

     regular expression (aka 'regexp') for the details of the pattern
     specification.

     'agrep' for approximate matching.

     'tolower', 'toupper' and 'chartr' for character translations.
     'charmatch', 'pmatch', 'match'. 'apropos' uses regexps and has
     nice examples.

_E_x_a_m_p_l_e_s:

     grep("[a-z]", letters)

     txt <- c("arm","foot","lefroo", "bafoobar")
     if(any(i <- grep("foo",txt)))
        cat("'foo' appears at least once in\n\t",txt,"\n")
     i # 2 and 4
     txt[i]

     ## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
     gsub("([ab])", "\\1_\\1_", "abc and ABC")

     txt <- c("The", "licenses", "for", "most", "software", "are",
       "designed", "to", "take", "away", "your", "freedom",
       "to", "share", "and", "change", "it.",
        "", "By", "contrast,", "the", "GNU", "General", "Public", "License",
        "is", "intended", "to", "guarantee", "your", "freedom", "to",
        "share", "and", "change", "free", "software", "--",
        "to", "make", "sure", "the", "software", "is",
        "free", "for", "all", "its", "users")
     ( i <- grep("[gu]", txt) ) # indices
     stopifnot( txt[i] == grep("[gu]", txt, value = TRUE) )

     ## Note that in locales such as en_US this includes B as the
     ## collation order is aAbBcCdEe ...
     (ot <- sub("[b-e]",".", txt))
     txt[ot != gsub("[b-e]",".", txt)]#- gsub does "global" substitution

     txt[gsub("g","#", txt) !=
         gsub("g","#", txt, ignore.case = TRUE)] # the "G" words

     regexpr("en", txt)

     ## trim trailing white space
     str = 'Now is the time      '
     sub(' +$', '', str)  ## spaces only
     sub('[[:space:]]+$', '', str) ## white space, POSIX-style
     sub('\\s+$', '', str, perl = TRUE) ## Perl-style white space

