kmeans                 package:stats                 R Documentation

_K-_M_e_a_n_s _C_l_u_s_t_e_r_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     Perform k-means clustering on a data matrix.

_U_s_a_g_e:

     kmeans(x, centers, iter.max = 10, nstart = 1,
            algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
                          "MacQueen"))

_A_r_g_u_m_e_n_t_s:

       x: A numeric matrix of data, or an object that can be coerced to
          such a matrix (such as a numeric vector or a data frame with
          all numeric columns).

 centers: Either the number of clusters or a set of initial (distinct)
          cluster centres.  If a number, a random set of (distinct)
          rows in 'x' is chosen as the initial centres.

iter.max: The maximum number of iterations allowed.

  nstart: If 'centers' is a number, how many random sets should be
          chosen?

algorithm: character: may be abbreviated.

_D_e_t_a_i_l_s:

     The data given by 'x' is clustered by the k-means method, which
     aims to partition the points into k groups such that the sum of
     squares from points to the assigned cluster centres is minimized.
     At the minimum, all cluster centres are at the mean of their
     Voronoi sets (the set of data points which are nearest to the
     cluster centre).

     The algorithm of Hartigan and Wong (1979) is used by default. 
     Note that some authors use k-means to refer to a specific
     algorithm rather than the general method: most commonly the
     algorithm given by MacQueen (1967) but sometimes that given by
     Lloyd (1957) and Forgy (1965). The Hartigan-Wong algorithm
     generally does a better job than either of those, but trying
     several random starts is often recommended.

     Except for the Lloyd-Forgy method, k clusters will always be
     returned if a number is specified. If an initial matrix of centres
     is supplied, it is possible that no point will be closest to one
     or more centres, which is currently an error for the Hartigan-Wong
     method.

_V_a_l_u_e:

     An object of class '"kmeans"' which is a list with components:

 cluster: A vector of integers indicating the cluster to which each
          point is allocated. 

 centers: A matrix of cluster centres.

withinss: The within-cluster sum of squares for each cluster.

    size: The number of points in each cluster.


     There is a 'print' method for this class.

_R_e_f_e_r_e_n_c_e_s:

     Forgy, E. W. (1965) Cluster analysis of multivariate data:
     efficiency vs interpretability of classifications. _Biometrics_
     *21*, 768-769.

     Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering
     algorithm. _Applied Statistics_ *28*, 100-108.

     Lloyd, S. P. (1957, 1982)  Least squares quantization in PCM.
     Technical Note, Bell Laboratories.  Published in 1982 in _IEEE
     Transactions on Information Theory_ *28*, 128-137.

     MacQueen, J. (1967)  Some methods for classification and analysis
     of multivariate observations. In _Proceedings of the Fifth
     Berkeley Symposium on  Mathematical Statistics and  Probability_,
     eds L. M. Le Cam & J. Neyman, *1*, pp. 281-297. Berkeley, CA:
     University of California Press.

_E_x_a_m_p_l_e_s:

     require(graphics)

     # a 2-dimensional example
     x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
                matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
     colnames(x) <- c("x", "y")
     (cl <- kmeans(x, 2))
     plot(x, col = cl$cluster)
     points(cl$centers, col = 1:2, pch = 8, cex=2)

     ## random starts do help here with too many clusters
     (cl <- kmeans(x, 5, nstart = 25))
     plot(x, col = cl$cluster)
     points(cl$centers, col = 1:5, pch = 8)

