chisq.test Error Message_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-05 20:01 出处：网络

Here\'s a problem I\'m encountering: Example Data df <- data.frame(1,2,3,4,5,6,7,8) df <- rbind(df,df,df,df)

相关专题：dataframe r

Here's a problem I'm encountering:

Example Data

df <- data.frame(1,2,3,4,5,6,7,8)
df <- rbind(df,df,df,df)

What I would like to do is find the p.value for the chisq.test of 1,2,3 vs. 4,5,6 in the data.frame defined above in the first row.

Let's try it flat out:

chisq.test(c(1,2,3),c(4,5,6))$p.value ## this works.

But when I try to do it by calling the columns/rows...

chisq.test(df[1,1:3],df[1,4:6])$p.value

Gives: Error in complete.cas开发者_如何学Pythones(x, y) : not all arguments have the same length

Interesting, because that doesn't seem to be true:

length(df[1,1:3])
length(df[1,4:6])

Any thoughts on how to change the notation to get the desired result?

?chisq.test tells us:

Arguments:

       x: a numeric vector or matrix. ‘x’ and ‘y’ can also both be
          factors.

       y: a numeric vector; ignored if ‘x’ is a matrix.  If ‘x’ is a
          factor, ‘y’ should be a factor of the same length.

If we look at df as per your Q, the subsets you define are:

> is.numeric(df[1,1:3])
[1] FALSE
> is.vector(df[1,1:3])
[1] FALSE
> is.matrix(df[1,1:3])
[1] FALSE

and the same for your other subset. What happens then is in the lap of the God's. What happens internally is that as df[1,1:3] is a data frame, it is converted first to a one column matrix, and thence to a vector:

Browse[2]> x ## here x is df[1,1:3]
[1] 1 2 3

whilst df[1,4:6] (y in the chisq.test function) is left untouched:

Browse[2]> y
  X4 X5 X6
1  4  5  6

and when the code calls complete.cases(x,y), we get the error you report:

Browse[2]> complete.cases(x, y)
Error in complete.cases(x, y) : not all arguments have the same length

complete.cases calls internal code so we can't see what is going on, but essentially R thinks x and y are not of the same length and this is because they are of different types.

@Prasad provides a work around, namely unlisting the 2 data frames you supply to chisq.test into vectors.

However, the way you are using the function doesn't make much sense, to me at least. One would normally store the data in columns, rather than rows of a data frame. It might not appear like there is a difference, but the columns of the data frame are its components, like the components of a list. Each individual component (column) is a discrete entity, a vector of data on the /n/ observations in the data frame. If we transpose your df (and cast back to a data frame) to reflect a more natural data set-up:

> df2 <- data.frame(t(df))

then we can use the approach you did, but index the separate rows of the first column of df2 (rather than the separate columns of the first row of df) in the chisq.test call:

> chisq.test(df2[1:3,1], df2[4:6,1])

    Pearson's Chi-squared test

data:  df2[1:3, 1] and df2[4:6, 1] 
X-squared = 6, df = 4, p-value = 0.1991

Warning message:
In chisq.test(df2[1:3, 1], df2[4:6, 1]) :
  Chi-squared approximation may be incorrect

This works, because R is able to drop the empty dimension in both subsets, so both inputs are vectors of the appropriate length:

> df2[1:3,1] ## drops the empty dimension!
[1] 1 2 3
> is.vector(df2[1:3,1])
[1] TRUE

Use unlist when you are extracting the rows from the data-frame:

> chisq.test(unlist(df[1,1:3]),unlist(df[1,4:6]))$p.value
[1] 0.1991483
Warning message:
In chisq.test(unlist(df[1, 1:3]), unlist(df[1, 4:6])) :
  Chi-squared approximation may be incorrect