factor                 package:base                 R Documentation

_F_a_c_t_o_r_s

_D_e_s_c_r_i_p_t_i_o_n:

     The function 'factor' is used to encode a vector as a factor (the
     terms 'category' and 'enumerated type' are also used for factors).
      If 'ordered' is 'TRUE', the factor levels are assumed to be
     ordered. For compatibility with S there is also a function
     'ordered'.

     'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the
     membership and coercion functions for these classes.

_U_s_a_g_e:

     factor(x = character(),
            levels = sort(unique.default(x), na.last = TRUE),
            labels = levels, exclude = NA, ordered = is.ordered(x))

     ordered(x, ...)

     is.factor(x)
     is.ordered(x)

     as.factor(x)
     as.ordered(x)

     addNA(x, ifany=FALSE)

_A_r_g_u_m_e_n_t_s:

       x: a vector of data, usually taking a small number of distinct
          values.

  levels: an optional vector of the values that 'x' might have taken.
          The default is the set of values taken by 'x', sorted into
          increasing order.

  labels: _either_ an optional vector of labels for the levels (in the
          same order as 'levels' after removing those in 'exclude'),
          _or_ a character string of length 1.

 exclude: a vector of values to be excluded when forming the set of
          levels. This should be of the same type as 'x', and will be
          coerced if necessary.

 ordered: logical flag to determine if the levels should be regarded as
          ordered (in the order given).

     ...: (in 'ordered(.)'): any of the above, apart from 'ordered'
          itself.

   ifany: (in 'addNA'): Only add an 'NA' level if it is used, i.e. if
          'any(is.na(x))'.

_D_e_t_a_i_l_s:

     The type of the vector 'x' is not restricted.

     Ordered factors differ from factors only in their class, but
     methods and the model-fitting functions treat the two classes
     quite differently.

     The encoding of the vector happens as follows. First all the
     values in 'exclude' are removed from 'levels'. If 'x[i]' equals
     'levels[j]', then the 'i'-th element of the result is 'j'.  If no
     match is found for 'x[i]' in 'levels', then the 'i'-th element of
     the result is set to 'NA'.

     Normally the 'levels' used as an attribute of the result are the
     reduced set of levels after removing those in 'exclude', but this
     can be altered by supplying 'labels'. This should either be a set
     of new labels for the levels, or a character string, in which case
     the levels are that character string with a sequence number
     appended.

     'factor(x, exclude=NULL)' applied to a factor is a no-operation
     unless there are unused levels: in that case, a factor with the
     reduced level set is returned.  If 'exclude' is used it should
     also be a factor with the same level set as 'x' or a set of codes
     for the levels to be excluded.

     The codes of a factor may contain 'NA'.  For a numeric 'x', set
     'exclude=NULL' to make 'NA' an extra level (prints as '<NA>'); by
     default, this is the last level.

     If 'NA' is a level, the way to set a code to be missing (as
     opposed to the code of the missing level) is to use 'is.na' on the
     left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';
     indexing inside 'is.na' does not work). Under those circumstances
     missing values are currently printed as '<NA>', i.e., identical to
     entries of level 'NA'.

     'is.factor' is generic: you can write methods to handle specific
     classes of objects, see InternalMethods.

_V_a_l_u_e:

     'factor' returns an object of class '"factor"' which has a set of
     integer codes the length of 'x' with a '"levels"' attribute of
     mode 'character'.  If 'ordered' is true (or 'ordered' is used) the
     result has class 'c("ordered", "factor")'.

     Applying 'factor' to an ordered or unordered factor returns a
     factor (of the same type) with just the levels which occur: see
     also '[.factor' for a more transparent way to achieve this.

     'is.factor' returns 'TRUE' or 'FALSE' depending on whether its
     argument is of type factor or not.  Correspondingly, 'is.ordered'
     returns 'TRUE' when its argument is ordered and 'FALSE' otherwise.

     'as.factor' coerces its argument to a factor. It is an abbreviated
     form of 'factor'.

     'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'
     otherwise.

     'addNA' modifies a factor by turning 'NA' into an extra level (so
     that 'NA' values are counted in tables, for instance).

_W_a_r_n_i_n_g:

     The interpretation of a factor depends on both the codes and the
     '"levels"' attribute.  Be careful only to compare factors with the
     same set of levels (in the same order).  In particular,
     'as.numeric' applied to a factor is meaningless, and may happen by
     implicit coercion.  To transform a factor 'f' to its original
     numeric values, 'as.numeric(levels(f))[f]' is recommended and
     slightly more efficient than 'as.numeric(as.character(f))'.

     The levels of a factor are by default sorted, but the sort order
     may well depend on the locale at the time of creation, and should
     not be assumed to be ASCII.

     There are some anomalies associated with factors that have 'NA' as
     a level.  It is suggested to use them sparingly, e.g., only for
     tabulation purposes.

_C_o_m_p_a_r_i_s_o_n _o_p_e_r_a_t_o_r_s _a_n_d _g_r_o_u_p _g_e_n_e_r_i_c _m_e_t_h_o_d_s:

     There are '"factor"' and '"ordered"' methods for the group generic
     'Ops', which provide methods for the Comparison operators.  (The
     rest of the group and the 'Math' and 'Summary' groups generate an
     error as they are not meaningful for factors.)

     Only '==' and '!=' can be used for factors: a factor can only be
     compared to another factor with an identical set of levels (not
     necessarily in the same ordering) or to a character vector.
     Ordered factors are compared in the same way, but the general
     dispatch mechanism precludes comparing ordered and unordered
     factors.

     All the comparison operators are available for ordered factors.
     Sorting is done by the levels of the operands: if both operands
     are ordered factors they must have the same level set.

_N_o_t_e:

     Storing character data as a factor is more efficient storage if
     there is even a small proportion of repeats.  On a 32-bit machine
     storing a string of n bytes takes 28 + 8*ceiling((n+1)/8) bytes
     whereas storing a factor code takes 4 bytes.  (On a 64-bit machine
     28 is replaced by 56 or more.)  Only if they were computed from
     the same values (or in some cases read from a file: see 'scan')
     will identical strings share storage.

_R_e_f_e_r_e_n_c_e_s:

     Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in
     S_. Wadsworth & Brooks/Cole.

_S_e_e _A_l_s_o:

     '[.factor' for subsetting of factors.

     'gl' for construction of balanced factors and 'C' for factors with
     specified contrasts. 'levels' and 'nlevels' for accessing the
     levels, and 'unclass' to get integer codes.

_E_x_a_m_p_l_e_s:

     (ff <- factor(substring("statistics", 1:10, 1:10), levels=letters))
     as.integer(ff)  # the internal codes
     factor(ff)      # drops the levels that do not occur
     ff[, drop=TRUE] # the same, more transparently

     factor(letters[1:20], labels="letter")

     class(ordered(4:1)) # "ordered", inheriting from "factor"

     ## suppose you want "NA" as a level, and to allow missing values.
     (x <- factor(c(1, 2, NA), exclude = NULL))
     is.na(x)[2] <- TRUE
     x  # [1] 1    <NA> <NA>
     is.na(x)
     # [1] FALSE  TRUE FALSE

     ## Using addNA()
     Month <- airquality$Month
     table(addNA(Month))
     table(addNA(Month, ifany=TRUE))

