silhouette              package:cluster              R Documentation

_C_o_m_p_u_t_e _o_r _E_x_t_r_a_c_t _S_i_l_h_o_u_e_t_t_e _I_n_f_o_r_m_a_t_i_o_n _f_r_o_m _C_l_u_s_t_e_r_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     Compute silhouette information according to a given clustering in
     k clusters.

_U_s_a_g_e:

     silhouette(x, ...)
     ## Default S3 method:
     silhouette  (x, dist, dmatrix, ...)
     ## S3 method for class 'partition':
     silhouette(x, ...)
     ## S3 method for class 'clara':
     silhouette(x, full = FALSE, ...)

     sortSilhouette(object, ...)
     ## S3 method for class 'silhouette':
     summary(object, FUN = mean, ...)
     ## S3 method for class 'silhouette':
     plot(x, nmax.lab = 40, max.strlen = 5,
          main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
          col = "gray",  do.col.sort = length(col) > 1, border = 0,
          cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)

_A_r_g_u_m_e_n_t_s:

       x: an object of appropriate class; for the 'default' method an
          integer vector with k different integer cluster codes or a
          list with such an 'x$clustering' component.  Note that
          silhouette statistics are only defined if 2 <= k <= n-1.

    dist: a dissimilarity object inheriting from class 'dist' or
          coercible to one.  If not specified, 'dmatrix' must be.

 dmatrix: a symmetric dissimilarity matrix (n * n), specified instead
          of 'dist', which can be more efficient.

    full: logical specifying if a _full_ silhouette should be computed
          for 'clara' object.  Note that this requires O(n^2) memory,
          since the full dissimilarity (see 'daisy') is needed
          internally.

  object: an object of class 'silhouette'.

     ...: further arguments passed to and from methods.

     FUN: function used to summarize silhouette widths.

nmax.lab: integer indicating the number of labels which is considered
          too large for single-name labeling the silhouette plot.

max.strlen: positive integer giving the length to which strings are
          truncated in silhouette plot labeling.

main, sub, xlab: arguments to 'title'; have a sensible non-NULL default
          here.

col, border, cex.names: arguments passed 'barplot()'; note that the
          default used to be 'col = heat.colors(n), border = par("fg")'
          instead.
           'col' can also be a color vector of length k for clusterwise
          coloring, see also 'do.col.sort': 

do.col.sort: logical indicating if the colors 'col' should be sorted
          "along" the silhouette; this is useful for casewise or
          clusterwise coloring.

  do.n.k: logical indicating if n and k "title text" should be written.

do.clus.stat: logical indicating if cluster size and averages should be
          written right to the silhouettes.

_D_e_t_a_i_l_s:

     For each observation i, the _silhouette width_ s(i) is defined as
     follows: 
      Put a(i) = average dissimilarity between i and all other points
     of the cluster to which i belongs (if i is the _only_ observation
     in its cluster, s(i) := 0 without further calculations). For all
     _other_ clusters C, put d(i,C) = average dissimilarity of i to all
     observations of C.  The smallest of these d(i,C) is b(i) := min_C
     d(i,C), and can be seen as the dissimilarity between i and its
     "neighbor" cluster, i.e., the nearest one to which it does _not_
     belong. Finally, 

             s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).


     'silhouette.default()' is now based on C code donated by Romain
     Francois (the R version being still available as
     'cluster:::silhouette.default.R').

     Observations with a large s(i) (almost 1) are very well clustered,
     a small s(i) (around 0) means that the observation lies between
     two clusters, and observations with a negative s(i) are probably
     placed in the wrong cluster.

_V_a_l_u_e:

     'silhouette()' returns an object, 'sil', of class 'silhouette'
     which is an [n x 3] matrix with attributes.  For each observation
     i, 'sil[i,]' contains the cluster to which i belongs as well as
     the neighbor cluster of i (the cluster, not containing i, for
     which the average dissimilarity between its observations and i is
     minimal), and the silhouette width s(i) of the observation.  The
     'colnames' correspondingly are 'c("cluster", "neighbor",
     "sil_width")'.

     'summary(sil)' returns an object of class 'summary.silhouette', a
     list with components 

si.summary: numerical 'summary' of the individual silhouette widths
          s(i).

clus.avg.widths: numeric (rank 1) array of clusterwise _means_ of
          silhouette widths where 'mean = FUN' is used.

avg.width: the total mean 'FUN(s)' where 's' are the individual
          silhouette widths.

clus.sizes: 'table' of the k cluster sizes.

    call: if available, the call creating 'sil'.

 Ordered: logical identical to 'attr(sil, "Ordered")', see below.


     'sortSilhouette(sil)' orders the rows of 'sil' as in the
     silhouette plot, by cluster (increasingly) and decreasing
     silhouette width s(i). 
      'attr(sil, "Ordered")' is a logical indicating if 'sil' _is_
     ordered as by 'sortSilhouette()'. In that case, 'rownames(sil)'
     will contain case labels or numbers, and 
      'attr(sil, "iOrd")' the ordering index vector.

_N_o_t_e:

     While 'silhouette()' is _intrinsic_ to the 'partition'
     clusterings, and hence has a (trivial) method for these, it is
     straightforward to get silhouettes from hierarchical clusterings
     from 'silhouette.default()' with 'cutree()' and distance as input.

     By default, for 'clara()' partitions, the silhouette is just for
     the best random _subset_ used.  Use 'full = TRUE' to compute (and
     later possibly plot) the full silhouette.

_R_e_f_e_r_e_n_c_e_s:

     Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the
     interpretation and validation of cluster analysis. _J. Comput.
     Appl. Math._, *20*, 53-65.

     chapter 2 of Kaufman, L. and Rousseeuw, P.J. (1990), see the
     references in 'plot.agnes'.

_S_e_e _A_l_s_o:

     'partition.object', 'plot.partition'.

_E_x_a_m_p_l_e_s:

     data(ruspini)
     pr4 <- pam(ruspini, 4)
     str(si <- silhouette(pr4))
     (ssi <- summary(si))
     plot(si) # silhouette plot
     plot(si, col = c("red", "green", "blue", "purple"))# with cluster-wise coloring

     si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra"))
     summary(si2) # has small values: "canberra"'s fault
     plot(si2, nmax= 80, cex.names=0.6)

     op <- par(mfrow= c(3,2), oma= c(0,0, 3, 0),
               mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2))
     for(k in 2:6)
        plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
     mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101",
           outer = TRUE, font = par("font.main"), cex = par("cex.main"))
     par(op)

     ## clara(): standard silhouette is just for the best random subset
     data(xclara)
     set.seed(7)
     str(xc1k <- xclara[sample(nrow(xclara), size = 1000) ,])
     cl3 <- clara(xc1k, 3)
     plot(silhouette(cl3))# only of the "best" subset of 46
     ## The full silhouette: internally needs large (36 MB) dist object:
     sf <- silhouette(cl3, full = TRUE) ## this is the same as
     s.full <- silhouette(cl3$clustering, daisy(xc1k))
     if(paste(R.version$major, R.version$minor, sep=".") >= "2.3.0")
        stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tol = 0))
     ## color dependent on original "3 groups of each 1000":
     plot(sf, col = 2+ as.integer(names(cl3$clustering) ) %/% 1000,
          main ="plot(silhouette(clara(.), full = TRUE))")

     ## Silhouette for a hierarchical clustering:
     ar <- agnes(ruspini)
     si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
                        daisy(ruspini))
     plot(si3, nmax = 80, cex.names = 0.5)
     ## 2 groups: Agnes() wasn't too good:
     si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini))
     plot(si4, nmax = 80, cex.names = 0.5)

