I've got a开发者_JAVA百科 huge data set with six columns (call them A, B, C, D, E, F), about 450,000 rows. I simply tried to find the correlation between columns A
and B
:
cor(A, B)
and I got
[1] NA
as a result. What can I do to fix this problem?
Try cor(A,B, use = "pairwise.complete.obs")
. That will ignore the NAs in your observations.
To be statistically rigorous, you should also look at the # of missing entries in your data and look at whether the missing at random assumption holds.
Edit 1: Take a look at ?cor
to see other options for the use
parameter.
You might consider using the rcorr function in the Hmisc package.
It is very fast, and only includes pairwise complete observations. The returned object contains a matrix
- of correlation scores
- with the number of observation used for each correlation value
- of a p-value for each correlation
Some example code is available here:
精彩评论