cut                   package:base                   R Documentation

_C_o_n_v_e_r_t _N_u_m_e_r_i_c _t_o _F_a_c_t_o_r

_D_e_s_c_r_i_p_t_i_o_n:

     'cut' divides the range of 'x' into intervals and codes the values
     in 'x' according to which interval they fall. The leftmost
     interval corresponds to level one, the next leftmost to level two
     and so on.

_U_s_a_g_e:

     cut(x, ...)

     ## Default S3 method:
     cut(x, breaks, labels = NULL,
         include.lowest = FALSE, right = TRUE, dig.lab = 3,
         ordered_result = FALSE, ...)

_A_r_g_u_m_e_n_t_s:

       x: a numeric vector which is to be converted to a factor by
          cutting.

  breaks: either a numeric vector of two or more cut points or a single
          number (greater than or equal to 2) giving the number of
          intervals into which 'x' is to be cut.

  labels: labels for the levels of the resulting category.  By default,
          labels are constructed using '"(a,b]"' interval notation. If
          'labels = FALSE', simple integer codes are returned instead
          of a factor.

include.lowest: logical, indicating if an 'x[i]' equal to the lowest
          (or highest, for 'right = FALSE') 'breaks' value should be
          included.

   right: logical, indicating if the intervals should be closed on the
          right (and open on the left) or vice versa.

 dig.lab: integer which is used when labels are not given. It
          determines the number of digits used in formatting the break
          numbers.

ordered_result: logical: should the result be an ordered factor?

     ...: further arguments passed to or from other methods.

_D_e_t_a_i_l_s:

     When 'breaks' is specified as a single number, the range of the
     data is divided into 'breaks' pieces of equal length, and then the
     outer limits are moved away by 0.1% of the range to ensure that
     the extreme values both fall within the break intervals.  (If 'x'
     is a constant vector, equal-length intervals are created that
     cover the single value.)

     If a 'labels' parameter is specified, its values are used to name
     the factor levels. If none is specified, the factor level labels
     are constructed as '"(b1, b2]"', '"(b2, b3]"' etc. for 'right =
     TRUE' and as '"[b1, b2)"', ... if 'right = FALSE'. In this case,
     'dig.lab' indicates the minimum number  of digits should be used
     in formatting the numbers 'b1', 'b2', .... A larger value (up to
     12) will be used if needed to distinguish between any pair of 
     endpoints: if this fails labels such as '"Range3"' will be used.

_V_a_l_u_e:

     A 'factor' is returned, unless 'labels = FALSE' which results in
     the mere integer level codes.

_N_o_t_e:

     Instead of 'table(cut(x, br))', 'hist(x, br, plot = FALSE)' is
     more efficient and less memory hungry.  Instead of 'cut(*, labels
     = FALSE)', 'findInterval()' is more efficient.

_R_e_f_e_r_e_n_c_e_s:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_. Wadsworth & Brooks/Cole.

_S_e_e _A_l_s_o:

     'split' for splitting a variable according to a group factor;
     'factor', 'tabulate', 'table', 'findInterval()'.

     'quantile' for ways of choosing breaks of roughly equal content
     (rather than length), 'cut2' in package 'Hmisc' for a canned way
     to form quantile groups.

_E_x_a_m_p_l_e_s:

     Z <- stats::rnorm(10000)
     table(cut(Z, breaks = -6:6))
     sum(table(cut(Z, breaks = -6:6, labels=FALSE)))
     sum(graphics::hist(Z, breaks = -6:6, plot=FALSE)$counts)

     cut(rep(1,5),4)#-- dummy
     tx0 <- c(9, 4, 6, 5, 3, 10, 5, 3, 5)
     x <- rep(0:8, tx0)
     stopifnot(table(x) == tx0)

     table( cut(x, b = 8))
     table( cut(x, breaks = 3*(-2:5)))
     table( cut(x, breaks = 3*(-2:5), right = FALSE))

     ##--- some values OUTSIDE the breaks :
     table(cx  <- cut(x, breaks = 2*(0:4)))
     table(cxl <- cut(x, breaks = 2*(0:4), right = FALSE))
     which(is.na(cx));  x[is.na(cx)]  #-- the first 9  values  0
     which(is.na(cxl)); x[is.na(cxl)] #-- the last  5  values  8

     ## Label construction:
     y <- stats::rnorm(100)
     table(cut(y, breaks = pi/3*(-3:3)))
     table(cut(y, breaks = pi/3*(-3:3), dig.lab=4))

     table(cut(y, breaks =  1*(-3:3), dig.lab=4))
     # extra digits don't "harm" here
     table(cut(y, breaks =  1*(-3:3), right = FALSE))
     #- the same, since no exact INT!

     ## sometimes the default dig.lab is not enough to be avoid confusion:
     aaa <- c(1,2,3,4,5,2,3,4,5,6,7)
     cut(aaa, 3)
     cut(aaa, 3, dig.lab=4, ordered = TRUE)

     ## one way to extract the breakpoints
     labs <- levels(cut(aaa, 3))
     cbind(lower = as.numeric( sub("\\((.+),.*", "\\1", labs) ),
           upper = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", labs) ))

