Here is a sample:
> tmp
label value1 开发者_JS百科value2
1 aa_x_x xx xx
2 bc_x_x xx xx
3 aa_x_x xx xx
4 bc_x_x xx xx
How to calculate median of all repeated labels (or more, of the corresponding values in other data frame columns), but taking into account only the first two letters (ie. "aa_1_1" and "aa_s_3" are the same values)? The list of labels is finite and usable.
I have read about aggregate
, %in%
, subset
and substr
, but I am unable to compile anything useful and simple.
Here is what I hope to get:
> tmp.result
label median1 some.calculation2
1 aa xx xx
2 bc xx xx
3 aa xx xx
4 bc xx xx
Thank you very much.
Have you tried making a new data frame--I'll call it tmp2
--where tmp2$label==substr(tmp$label,0,2)
? From there, you can, for example, use tapply(tmp2$value1,tmp2$label,mean)
to get the average values of value1
aggregated over tmp2$label
.
An option using dplyr
library(dplyr)
tmp %>%
group_by(label=sub('_.*$', '', label)) %>%
transmute(median1=median(value1), mean1=mean(value2))
Or data.table
library(data.table)
setDT(tmp)[, c('median1', 'mean1') := list(median(value1),
mean1= mean(value2)) , .(label=sub('_.*$', '', label))][, c(1,4:5),
with=FALSE]
精彩评论