cor                  package:stats                  R Documentation

_C_o_r_r_e_l_a_t_i_o_n, _V_a_r_i_a_n_c_e _a_n_d _C_o_v_a_r_i_a_n_c_e (_M_a_t_r_i_c_e_s)

_D_e_s_c_r_i_p_t_i_o_n:

     'var', 'cov' and 'cor' compute the variance of 'x' and the
     covariance or correlation of 'x' and 'y' if these are vectors.  If
     'x' and 'y' are matrices then the covariances (or correlations)
     between the columns of 'x' and the columns of 'y' are computed.

     'cov2cor' scales a covariance matrix into the corresponding
     correlation matrix _efficiently_.

_U_s_a_g_e:

     var(x, y = NULL, na.rm = FALSE, use)

     cov(x, y = NULL, use = "all.obs",
         method = c("pearson", "kendall", "spearman"))

     cor(x, y = NULL, use = "all.obs",
          method = c("pearson", "kendall", "spearman"))

     cov2cor(V)

_A_r_g_u_m_e_n_t_s:

       x: a numeric vector, matrix or data frame.

       y: 'NULL' (default) or a vector, matrix or data frame with
          compatible dimensions to 'x'.  The default is equivalent to
          'y = x' (but more efficient).

   na.rm: logical. Should missing values be removed?

     use: an optional character string giving a method for computing
          covariances in the presence of missing values.  This must be
          (an abbreviation of) one of the strings '"all.obs"',
          '"complete.obs"' or '"pairwise.complete.obs"'.

  method: a character string indicating which correlation coefficient
          (or covariance) is to be computed.  One of '"pearson"'
          (default), '"kendall"', or '"spearman"', can be abbreviated.

       V: symmetric numeric matrix, usually positive definite such as a
          covariance matrix.

_D_e_t_a_i_l_s:

     For 'cov' and 'cor' one must _either_ give a matrix or data frame
     for 'x' _or_ give both 'x' and 'y'.

     'var' is just another interface to 'cov', where 'na.rm' is used to
     determine the default for 'use' when that is unspecified.  If
     'na.rm' is 'TRUE' then the complete observations (rows) are used
     ('use = "complete"') to compute the variance.  Otherwise ('use =
     "all"'), 'var' will give an error if there are missing values.

     If 'use' is '"all.obs"', then the presence of missing observations
     will produce an error. If 'use' is '"complete.obs"' then missing
     values are handled by casewise deletion.  Finally, if 'use' has
     the value '"pairwise.complete.obs"' then the correlation between
     each pair of variables is computed using all complete pairs of
     observations on those variables. This can result in covariance or
     correlation matrices which are not positive semidefinite.

     The denominator n - 1 is used which gives an unbiased estimator of
     the (co)variance for i.i.d. observations. These functions return
     'NA' when there is only one observation (whereas S-PLUS has been
     returning 'NaN'), and  fail if 'x' has length zero.

     For 'cor()', if 'method' is '"kendall"' or '"spearman"', Kendall's
     tau or Spearman's rho statistic is used to estimate a rank-based
     measure of association.  These are more robust and have be
     recommended if the data do not necessarily come from a bivariate
     normal distribution.
      For 'cov()', a non-Pearson method is unusual but available for
     the sake of completeness.  Note that '"spearman"' basically
     computes 'cor(R(x), R(y))' (or 'cov(.,.)') where 'R(u) := rank(u,
     na.last="keep")'. Notice also that the ranking is (currently) done
     removing only cases that are missing on the variable itself, which
     may not be what you expect if you let 'use' be '"complete.obs"' or
     '"pairwise.complete.obs"'.

     Scaling a covariance matrix into a correlation one can be achieved
     in many ways, mathematically most appealing by multiplication with
     a diagonal matrix from left and right, or more efficiently by
     using 'sweep(.., FUN = "/")' twice.  The 'cov2cor' function is
     even a bit more efficient, and provided mostly for didactical
     reasons.

_V_a_l_u_e:

     For 'r <- cor(*, use = "all.obs")', it is now guaranteed that
     'all(r <= 1)'.

_R_e_f_e_r_e_n_c_e_s:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_. Wadsworth & Brooks/Cole.

_S_e_e _A_l_s_o:

     'cor.test' (package 'ctest') for confidence intervals (and tests).

     'cov.wt' for _weighted_ covariance computation.

     'sd' for standard deviation (vectors).

_E_x_a_m_p_l_e_s:

     var(1:10)# 9.166667

     var(1:5,1:5)# 2.5

     ## Two simple vectors
     cor(1:10,2:11)# == 1

     ## Correlation Matrix of Multivariate sample:
     data(longley)
     (Cl <- cor(longley))
     ## Graphical Correlation Matrix:
     symnum(Cl) # highly correlated

     ## Spearman's rho  and  Kendall's tau
     symnum(clS <- cor(longley, method = "spearman"))
     symnum(clK <- cor(longley, method = "kendall"))
     ## How much do they differ?
     i <- lower.tri(Cl)
     cor(cbind(P = Cl[i], S = clS[i], K = clK[i]))

     ## cov2cor() scales a covariance matrix by its diagonal
     ##           to become the correlation matrix.
     cov2cor # see the function definition {and learn ..}
     stopifnot(all.equal(Cl, cov2cor(cov(longley))),
               all.equal(cor(longley, method="kendall"),
                 cov2cor(cov(longley, method="kendall"))))

     ##--- Missing value treatment:
     data(swiss)
     C1 <- cov(swiss)
     range(eigen(C1, only=TRUE)$val) # 6.19  1921
     swM <- swiss
     swM[1,2] <- swM[7,3] <- swM[25,5] <- NA # create 3 "missing"
     try(cov(swM)) # Error: missing obs...
     C2 <- cov(swM, use = "complete")
     range(eigen(C2, only=TRUE)$val) # 6.46  1930
     C3 <- cov(swM, use = "pairwise")
     range(eigen(C3, only=TRUE)$val) # 6.19  1938

     (scM <- symnum(cor(swM, method = "kendall", use = "complete")))
     ## Kendall's tau doesn't change much: identical symnum codings!
     identical(scM, symnum(cor(swiss, method = "kendall")))

     all.equal(cov2cor(cov(swM, method = "kendall", use = "pairwise")),
                       cor(swM, method = "kendall", use = "pairwise"))

