silhouette              package:cluster              R Documentation

_C_o_m_p_u_t_e _o_r _E_x_t_r_a_c_t _S_i_l_h_o_u_e_t_t_e _I_n_f_o_r_m_a_t_i_o_n _f_r_o_m _C_l_u_s_t_e_r_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     Compute silhouette information according to a given clustering in
     k clusters.

_U_s_a_g_e:

     silhouette(x, ...)
     ## Default S3 method:
     silhouette  (x, dist, dmatrix, ...)
     ## S3 method for class 'partition':
     silhouette(x, ...)

     sortSilhouette(object, ...)
     ## S3 method for class 'silhouette':
     summary(object, FUN = mean, ...)
     ## S3 method for class 'silhouette':
     plot(x, nmax.lab = 40, max.strlen = 5,
          main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
          col = "gray",  do.col.sort = length(col) > 1, border = 0,
          cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)

_A_r_g_u_m_e_n_t_s:

       x: an object of appropriate class; for the 'default' method an
          integer vector with k different integer cluster codes or a
          list with such an 'x$clustering' component.  Note that
          silhouette statistics are only defined if 2 <= k <= n-1.

    dist: a dissimilarity object inheriting from class 'dist' or
          coercible to one.  If not specified, 'dmatrix' must be.

 dmatrix: a symmetric dissimilarity matrix (n * n), specified instead
          of 'dist', which can be more efficient.

  object: an object of class 'silhouette'.

     ...: further arguments passed to and from methods.

     FUN: function used summarize silhouette widths.

nmax.lab: integer indicating the number of labels which is considered
          too large for single-name labeling the silhouette plot.

max.strlen: positive integer giving the length to which strings are
          truncated in silhouette plot labeling.

main, sub, xlab: arguments to 'title'; have a sensible non-NULL default
          here.

col, border, cex.names: arguments passed 'barplot()'; note that the
          default used to be 'col = heat.colors(n), border = par("fg")'
          instead.
           'col' can also be a color vector of length k for clusterwise
          coloring, see also 'do.col.sort': 

do.col.sort: logical indicating if the colors 'col' should be sorted
          ``along'' the silhouette; this is useful for casewise or
          clusterwise coloring.

  do.n.k: logical indicating if n and k ``title text'' should be
          written.

do.clus.stat: logical indicating if cluster size and averages should be
          written right to the silhouettes.

_D_e_t_a_i_l_s:

     For each observation i, the _silhouette width_ s(i) is defined as
     follows: 
      Put a(i) = average dissimilarity between i and all other points
     of the cluster to which i belongs.  For all _other_ clusters C,
     put d(i,C) = average dissimilarity of i to all observations of C. 
     The smallest of these d(i,C) is b(i) := min_C d(i,C), and can be
     seen as the dissimilarity between i and its ``neighbor'' cluster,
     i.e., the nearest one to which it does _not_ belong. Finally, 

             s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).


     Observations with a large s(i) (almost 1) are very well clustered,
     a small s(i) (around 0) means that the observation lies between
     two clusters, and observations with a negative s(i) are probably
     placed in the wrong cluster.

_V_a_l_u_e:

     'silhouette()' returns an object, 'sil', of class 'silhouette'
     which is an [n x 3] matrix with attributes.  For each observation
     i, 'sil[i,]' contains the cluster to which i belongs as well as
     the neighbor cluster of i (the cluster, not containing i, for
     which the average dissimilarity between its observations and i is
     minimal), and the silhouette width s(i) of the observation.  The
     'colnames' correspondingly are 'c("cluster", "neighbor",
     "sil_width")'.

     'summary(sil)' returns an object of class 'summary.silhouette', a
     list with components 

si.summary: numerical 'summary' of the individual silhouette widths
          s(i).

clus.avg.widths: numeric (rank 1) array of clusterwise _means_ of
          silhouette widths where 'mean = FUN' is used.

avg.width: the total mean 'FUN(s)' where 's' are the individual
          silhouette widths.

clus.sizes: 'table' of the k cluster sizes.

    call: if available, the call creating 'sil'.

 Ordered: logical identical to 'attr(sil, "Ordered")', see below.


     'sortSilhouette(sil)' orders the rows of 'sil' as in the
     silhouette plot, by cluster (increasingly) and decreasing
     silhouette width s(i). 
      'attr(sil, "Ordered")' is a logical indicating if 'sil' _is_
     ordered as by 'sortSilhouette()'. In that case, 'rownames(sil)'
     will contain case labels or numbers, and 
      'attr(sil, "iOrd")' the ordering index vector.

_N_o_t_e:

     While 'silhouette()' is _intrinsic_ to the 'partition'
     clusterings, and hence has a (trivial) method for these, it is
     straightforward to get silhouettes from hierarchical clusterings
     from 'silhouette.default()' with 'cutree()' and distance as input.

_R_e_f_e_r_e_n_c_e_s:

     Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the
     interpretation and validation of cluster analysis. _J. Comput.
     Appl. Math._, *20*, 53-65.

     chapter 2 of Kaufman, L. and Rousseeuw, P.J. (1990), see the
     references in 'plot.agnes'.

_S_e_e _A_l_s_o:

     'partition.object', 'plot.partition'.

_E_x_a_m_p_l_e_s:

      data(ruspini)
      pr4 <- pam(ruspini, 4)
      str(si <- silhouette(pr4))
      (ssi <- summary(si))
      plot(si) # silhouette plot

      si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra"))
      summary(si2) # has small values: "canberra"'s fault
      plot(si2, nmax= 80, cex.names=0.6)

      par(mfrow = c(3,2), oma = c(0,0, 3, 0))
      for(k in 2:6)
         plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
      mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101",
            outer = TRUE, font = par("font.main"), cex = par("cex.main"))

      ## Silhouette for a hierarchical clustering:
      ar <- agnes(ruspini)
      si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
                        daisy(ruspini))
      plot(si3, nmax = 80, cex.names = 0.5)
      ## 2 groups: Agnes() wasn't too good:
      si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini))
      plot(si4, nmax = 80, cex.names = 0.5)

