Here's a problem I'm encountering:
Example Data
df <- data.frame(1,2,3,4,5,6,7,8)
df <- rbind(df,df,df,df)
What I would like to do is find the p.value for the chisq.test of 1,2,3 vs. 4,5,6 in the data.frame defined above in the first row.
Let's try it flat out:
chisq.test(c(1,2,3),c(4,5,6))$p.value ## this works.
But when I try to do it by calling the columns/rows...
chisq.test(df[1,1:3],df[1,4:6])$p.value
Gives: Error in complete.cas开发者_如何学Pythones(x, y) : not all arguments have the same length
Interesting, because that doesn't seem to be true:
length(df[1,1:3])
length(df[1,4:6])
Any thoughts on how to change the notation to get the desired result?
?chisq.test
tells us:
Arguments:
x: a numeric vector or matrix. ‘x’ and ‘y’ can also both be
factors.
y: a numeric vector; ignored if ‘x’ is a matrix. If ‘x’ is a
factor, ‘y’ should be a factor of the same length.
If we look at df
as per your Q, the subsets you define are:
> is.numeric(df[1,1:3])
[1] FALSE
> is.vector(df[1,1:3])
[1] FALSE
> is.matrix(df[1,1:3])
[1] FALSE
and the same for your other subset. What happens then is in the lap of the God's. What happens internally is that as df[1,1:3]
is a data frame, it is converted first to a one column matrix, and thence to a vector:
Browse[2]> x ## here x is df[1,1:3]
[1] 1 2 3
whilst df[1,4:6]
(y
in the chisq.test
function) is left untouched:
Browse[2]> y
X4 X5 X6
1 4 5 6
and when the code calls complete.cases(x,y)
, we get the error you report:
Browse[2]> complete.cases(x, y)
Error in complete.cases(x, y) : not all arguments have the same length
complete.cases
calls internal code so we can't see what is going on, but essentially R thinks x
and y
are not of the same length and this is because they are of different types.
@Prasad provides a work around, namely unlisting the 2 data frames you supply to chisq.test
into vectors.
However, the way you are using the function doesn't make much sense, to me at least. One would normally store the data in columns, rather than rows of a data frame. It might not appear like there is a difference, but the columns of the data frame are its components, like the components of a list. Each individual component (column) is a discrete entity, a vector of data on the /n/ observations in the data frame. If we transpose your df
(and cast back to a data frame) to reflect a more natural data set-up:
> df2 <- data.frame(t(df))
then we can use the approach you did, but index the separate rows of the first column of df2
(rather than the separate columns of the first row of df
) in the chisq.test
call:
> chisq.test(df2[1:3,1], df2[4:6,1])
Pearson's Chi-squared test
data: df2[1:3, 1] and df2[4:6, 1]
X-squared = 6, df = 4, p-value = 0.1991
Warning message:
In chisq.test(df2[1:3, 1], df2[4:6, 1]) :
Chi-squared approximation may be incorrect
This works, because R is able to drop the empty dimension in both subsets, so both inputs are vectors of the appropriate length:
> df2[1:3,1] ## drops the empty dimension!
[1] 1 2 3
> is.vector(df2[1:3,1])
[1] TRUE
Use unlist
when you are extracting the rows from the data-frame:
> chisq.test(unlist(df[1,1:3]),unlist(df[1,4:6]))$p.value
[1] 0.1991483
Warning message:
In chisq.test(unlist(df[1, 1:3]), unlist(df[1, 4:6])) :
Chi-squared approximation may be incorrect
精彩评论