clara                package:cluster                R Documentation

_C_l_u_s_t_e_r_i_n_g _L_a_r_g_e _A_p_p_l_i_c_a_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Computes a '"clara"' object, a list representing a clustering of
     the data into 'k' clusters.

_U_s_a_g_e:

     clara(x, k, metric = "euclidean", stand = FALSE, samples = 5,
           sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE,
           keep.data = medoids.x, rngR = FALSE)

_A_r_g_u_m_e_n_t_s:

       x: data matrix or data frame, each row corresponds to an
          observation, and each column corresponds to a variable.  All
          variables must be numeric. Missing values (NAs) are allowed.

       k: integer, the number of clusters. It is required that 0 < k <
          n where n is the number of observations (i.e., n =
          'nrow(x)').

  metric: character string specifying the metric to be used for
          calculating dissimilarities between observations. The
          currently available options are "euclidean" and "manhattan".
          Euclidean distances are root sum-of-squares of differences,
          and manhattan distances are the sum of absolute differences. 

   stand: logical, indicating if the measurements in 'x' are
          standardized before calculating the dissimilarities. 
          Measurements are standardized for each variable (column), by
          subtracting the variable's mean value and dividing by the
          variable's mean absolute deviation. 

 samples: integer, number of samples to be drawn from the dataset.

sampsize: integer, number of observations in each sample. 'sampsize'
          should be higher than the number of clusters ('k') and at
          most the number of observations (n = 'nrow(x)').

   trace: integer indicating a _trace level_ for diagnostic output
          during the algorithm.

medoids.x: logical indicating if the medoids should be returned,
          identically to some rows of the input data 'x'.  If 'FALSE',
          'keep.data' must be false as well, and the medoid indices,
          i.e., row numbers of the medoids will still be returned
          ('i.med' component), and the algorithm saves space by needing
          one copy less of 'x'.

keep.data: logical indicating if the (_scaled_ if 'stand' is true) data
          should be kept in the result. Setting this to 'FALSE' saves
          memory (and hence time), but disables 'clusplot()'ing of the
          result.  Use 'medoids.x = FALSE' to save even more memory.

    rngR: logical indicating if R's random number generator should be
          used instead of the primitive clara()-builtin one.  If true,
          this also means that each call to 'clara()' returns a
          different result - though only slightly different in good
          situations.

_D_e_t_a_i_l_s:

     'clara' is fully described in chapter 3 of Kaufman and Rousseeuw
     (1990). Compared to other partitioning methods such as 'pam', it
     can deal with much larger datasets.  Internally, this is achieved
     by considering sub-datasets of fixed size ('sampsize') such that
     the time and storage requirements become linear in n rather than
     quadratic.

     Each sub-dataset is partitioned into 'k' clusters using the same
     algorithm as in 'pam'.
      Once 'k' representative objects have been selected from the
     sub-dataset, each observation of the entire dataset is assigned to
     the nearest medoid.

     The sum of the dissimilarities of the observations to their
     closest medoid is used as a measure of the quality of the
     clustering.  The sub-dataset for which the sum is minimal, is
     retained.  A further analysis is carried out on the final
     partition.

     Each sub-dataset is forced to contain the medoids obtained from
     the best sub-dataset until then.  Randomly drawn observations are
     added to this set until 'sampsize' has been reached.

_V_a_l_u_e:

     an object of class '"clara"' representing the clustering.  See
     'clara.object' for details.

_N_o_t_e:

     By default, the random sampling is implemented with a _very_
     simple scheme (with period 2^{16} = 65536) inside the Fortran
     code, independently of R's random number generation, and as a
     matter of fact, deterministically.  Alternatively, we recommend
     setting 'rngR = TRUE' which uses R's random number generators. 
     Then, 'clara()' results are made reproducible typically by using
     'set.seed()' before calling 'clara'.

     The storage requirement of 'clara' computation (for small 'k') is
     about O(n * p) + O(j^2) where j = 'sampsize', and (n,p) =
     'dim(x)'. The CPU computing time (again assuming small 'k') is
     about O(n * p * j^2 * N), where N = 'samples'.

     For "small" datasets, the function 'pam' can be used directly. 
     What can be considered _small_, is really a function of available
     computing power, both memory (RAM) and speed. Originally (1990),
     "small" meant less than 100 observations; in 1997, the authors
     said _"small (say with fewer than 200 observations)"_; as of 2006,
     you can use 'pam' with several thousand observations.

_A_u_t_h_o_r(_s):

     Kaufman and Rousseeuw (see 'agnes'), originally. All arguments
     from 'trace' on, and most R documentation and all tests by Martin
     Maechler.

_S_e_e _A_l_s_o:

     'agnes' for background and references; 'clara.object', 'pam',
     'partition.object', 'plot.partition'.

_E_x_a_m_p_l_e_s:

     ## generate 500 objects, divided into 2 clusters.
     x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
                cbind(rnorm(300,50,8), rnorm(300,50,8)))
     clarax <- clara(x, 2)
     clarax
     clarax$clusinfo
     plot(clarax)

     ## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
     ## objects each.
     data(xclara)
     (clx3 <- clara(xclara, 3))
     ## Plot similar to Figure 5 in Struyf et al (1996)
     ## Not run: plot(clx3, ask = TRUE)


     ## Try 100 times *different* random samples -- for reliability:
     nSim <- 100
     nCl <- 3 # = no.classes
     set.seed(421)# (reproducibility)
     cl <- matrix(NA,nrow(xclara), nSim)
     for(i in 1:nSim)
        cl[,i] <- clara(xclara, nCl, medoids.x = FALSE, rngR = TRUE)$cluster
     tcl <- apply(cl,1, tabulate, nbins = nCl)
     ## those that are not always in same cluster (5 out of 3000 for this seed):
     (iDoubt <- which(apply(tcl,2, function(n) all(n < nSim))))
     if(length(iDoubt)) { # (not for all seeds)
       tabD <- tcl[,iDoubt, drop=FALSE]
       dimnames(tabD) <- list(cluster = paste(1:nCl), obs = format(iDoubt))
       t(tabD) # how many times in which clusters
     }

