strsplit                package:base                R Documentation

_S_p_l_i_t _t_h_e _E_l_e_m_e_n_t_s _o_f _a _C_h_a_r_a_c_t_e_r _V_e_c_t_o_r

_D_e_s_c_r_i_p_t_i_o_n:

     Split the elements of a character vector 'x' into substrings
     according to the presence of substring 'split' within them.

_U_s_a_g_e:

     strsplit(x, split, extended = TRUE, fixed = FALSE, perl = FALSE)

_A_r_g_u_m_e_n_t_s:

       x: character vector, each element of which is to be split. 
          Other inputs, including a factor, will give an error. 

   split: character vector (or object which can be coerced to such)
          containing regular expression(s) (unless 'fixed = TRUE') to
          use for splitting.  If empty matches occur, in particular if
          'split' has length 0, 'x' is split into single characters. If
          'split' has length greater than 1, it is re-cycled along 'x'. 

extended: logical.  If 'TRUE' (the default), extended regular
          expression matching is used, and if 'FALSE' basic regular
          expressions are used. 

   fixed: logical.  If 'TRUE' match 'split' exactly, otherwise use
          regular expressions.  Has priority over 'perl' and
          'extended'. 

    perl: logical.  Should perl-compatible regexps be used? Has
          priority over 'extended'. 

_D_e_t_a_i_l_s:

     Argument 'split' will be coerced to character, so you will see
     uses with 'split = NULL' to mean 'split = character(0)', including
     in the examples below.

     Note that splitting into single characters can be done _via_
     'split=character(0)' or 'split=""'; the two are equivalent. The
     definition of 'character' here depends on the locale (and perhaps
     OS): in a single-byte locale it is a byte, and in a multi-byte
     locale it is the unit represented by a 'wide character' (almost
     always a Unicode point).

     A missing value of 'split' does not split the corresponding
     element(s) of 'x' at all.

     The algorithm applied to each input string is


         repeat {
             if the string is empty
                 break.
             if there is a match
                 add the string to the left of the match to the output.
                 remove the match and all to the left of it.
             else
                 add the string to the output.
                 break.
         }

     Note that this means that if there is a match at the beginning of
     a (non-empty) string, the first element of the output is '""', but
     if there is a match at the end of the string, the output is the
     same as with the match removed.

_V_a_l_u_e:

     A list of length 'length(x)' the 'i'-th element of which contains
     the vector of splits of 'x[i]'.

     If 'fixed = TRUE' or 'perl = TRUE' and if any element of 'x' or
     'split' is declared to be in UTF-8 (see 'Encoding', non-ASCII
     character strings in the result will be in UTF-8 and have the
     encoding declared as UTF-8.  Otherwise they will be in the current
     locale's encoding, and be declared to have the encoding of the
     current locale if either Latin-1 or UTF-8 and  the corresponding
     input had a declared encoding.

_W_a_r_n_i_n_g:

     The standard regular expression code has been reported to be very
     slow when applied to extremely long character strings (tens of
     thousands of characters or more): the code used when 'perl = TRUE'
     seems much faster and more reliable for such usages.

     The 'perl = TRUE' option is only implemented for single-byte and
     UTF-8 encodings, and will warn if used in a non-UTF-8 multibyte
     locale.

_S_e_e _A_l_s_o:

     'paste' for the reverse, 'grep' and 'sub' for string search and
     manipulation; further 'nchar', 'substr'.

     'regular expression' for the details of the pattern specification.

_E_x_a_m_p_l_e_s:

     noquote(strsplit("A text I want to display with spaces", NULL)[[1]])

     x <- c(as = "asfef", qu = "qwerty", "yuiop[", "b", "stuff.blah.yech")
     # split x on the letter e
     strsplit(x,"e")

     unlist(strsplit("a.b.c", "."))
     ## [1] "" "" "" "" ""
     ## Note that 'split' is a regexp!
     ## If you really want to split on '.', use
     unlist(strsplit("a.b.c", "\\."))
     ## [1] "a" "b" "c"
     ## or
     unlist(strsplit("a.b.c", ".", fixed = TRUE))

     ## a useful function: rev() for strings
     strReverse <- function(x)
             sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
     strReverse(c("abc", "Statistics"))

     ## get the first names of the members of R-core
     a <- readLines(file.path(R.home("doc"),"AUTHORS"))[-(1:8)]
     a <- a[(0:2)-length(a)]
     (a <- sub(" .*","", a))
     # and reverse them
     strReverse(a)

     ## Note that final empty strings are not produced:
     strsplit(paste(c("", "a", ""), collapse="#"), split="#")[[1]]
     # [1] ""  "a"
     ## and also an empty string is only produced before a definite match:
     strsplit("", " ")[[1]]    # character(0)
     strsplit(" ", " ")[[1]]   # [1] ""

