开发者

How to calculate correlation of two variables in a huge data set in R?

开发者 https://www.devze.com 2023-04-07 05:37 出处:网络
I\'ve got a开发者_JAVA百科 huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A and B:

I've got a开发者_JAVA百科 huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A and B:

cor(A, B)

and I got

[1] NA

as a result. What can I do to fix this problem?


Try cor(A,B, use = "pairwise.complete.obs"). That will ignore the NAs in your observations.

To be statistically rigorous, you should also look at the # of missing entries in your data and look at whether the missing at random assumption holds.

Edit 1: Take a look at ?cor to see other options for the use parameter.


You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

  1. of correlation scores
  2. with the number of observation used for each correlation value
  3. of a p-value for each correlation

Some example code is available here:

0

精彩评论

暂无评论...
验证码 换一张
取 消