I have a data.frame with 20 columns. The first two are factors, and the rest are numeric. I'd like to use the first two columns as split variables and then apply the mean()
to the remaining columns.
This seems like a quick and easy job for ddply()
, however, the results for the output data.frame are not开发者_StackOverflow社区 what I am looking for. Here is a minimal example with just one column of data:
Aa <- c(rep(c("A", "a"), each = 20))
Bb <- c(rep(c("B", "b", "B", "b"), each = 10))
x <- runif(40)
df1 <- data.frame(Aa, Bb, x)
ddply(df1, .(Aa, Bb), mean)
The output is:
Aa Bb x
1 NA NA 0.5193275
2 NA NA 0.4491907
3 NA NA 0.4848128
4 NA NA 0.4717899
Warning messages:
1: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
The warning is repeated 8 times, presumably once for each call to mean()
. I'm guessing this comes from trying to take the mean of a factor. I could write this as:
ddply(df1, .(Aa, Bb), function(df1) mean(df1$x))
or
ddply(df1, .(Aa, Bb), summarize, x = mean(x))
both of which do work (not giving NAs), but I would rather avoid writing out 18 such x = mean(x)
statements, one for each of my numeric columns.
Is there a general solution? I'm not wedded to ddply
if there is a better answer elsewhere.
Since you are reducing hte number of rows, you need to use summarise
:
> ddply(df1, .(Aa, Bb), summarise, mean_x =mean(x) )
Aa Bb mean_x
1 a b 0.3790675
2 a B 0.4242922
3 A b 0.5622329
4 A B 0.4574471
It's just as easy to use aggregate in this instance. Let's say you had two variables:
> aggregate(df1[-(1:2)], df1[1:2], mean)
Aa Bb x y
1 a b 0.4249121 0.4639192
2 A b 0.6127175 0.4639192
3 a B 0.4522292 0.4826715
4 A B 0.5201965 0.4826715
ddply
supports negative indexing as well:
ddply(df1, .(Aa, Bb), function(x) mean(x[-(1:2)]))
精彩评论