I have the data frame
data<-data.frame(id=开发者_开发问答c("A","A","B","B"), day=c(5,6,1,2), duration=c(12,1440,5,6), obs.period=c(60, 60,100,100))
showing Subject ID, day of event, duration of event, and observation period of Subject
I want to transform the data set to that it will show the whole observation period for each subject (all days of observation), while adding zero as duration values for the days where no event was observed
For the above dataset this would be something like this:
id day duration obs.period
A 1 0 60
A 2 0 60
A 3 0 60
A 4 0 60
A 5 12 60
A 6 1440 60
A 7 0 60
A 8 0 60
.
.
.
A 60 0 60
B 1 5 100
B 2 6 100
B 3 0 100
B 4 0 100
.
.
.
.
B 100 0 100
Any ideas?
Here's one approach using the plyr
package. First, create a function to expand the data into the appropriate number of rows. Then, index into that new data.frame with the duration info from the original data. Finally, call this function with ddply()
and group on the id variable.
require(plyr)
FUN <- function(x){
dat <- data.frame(
id = x[1,1]
, day = seq_len(x[1,4])
, duration = 0
, obs.period = x[1,4]
)
dat[dat$id == x$id & dat$day == x$day, "duration"] <- x$duration
return(dat)
}
ddply(data, "id", FUN)
id day duration obs.period
1 A 1 0 60
2 A 2 0 60
3 A 3 0 60
4 A 4 0 60
5 A 5 12 60
6 A 6 1440 60
...
61 B 1 5 100
62 B 2 6 100
63 B 3 0 100
...
160 B 100 0 100
Create an empty data frame with the proper index columns, but no value columns, then merge it with your data and replace the NA's in the value columns with zeros.
data<-data.frame(id=c("A","A","B","B"), day=c(5,6,1,2), duration=c(12,1440,5,6), obs.period=c(60, 60,100,100))
zilch=data.frame(id=rep(c("A","B"),each=60),day=1:60)
all=merge(zilch,data, all=T)
all[is.na(all$duration),"duration"]<-0
all[is.na(all$obs.period),"obs.period"]<-0
I would first create a data frame to contain the results.
ob.period <- with(data, tapply(obs.period, id, max))
n <- sum(ob.period)
result <- data.frame(id=rep(names(ob.period), ob.period),
day=unlist(lapply(ob.period, function(a) 1:a)),
duration=rep(0, n),
obs.period=rep(ob.period,ob.period))
Then I would paste id
and day
together, use match
to find the relevant rows in the larger data frame, and plug in the duration values.
idday.sm <- paste(data$id, data$day, sep=":")
idday.lg <- paste(result$id, result$day, sep=":")
result$duration[match(idday.sm, idday.lg)] <- data$duration
Here is an approach with plyr
fill1 <- function(df) {
full_period <- 1:100
to_fill <- setdiff(full_period, df$day)
fill_id <- df[1,"id"]
fill_dur <- 0
fill_obs.p <- df[1,"obs.period"]
rows_to_add <- data.frame(id=fill_id, day=to_fill, duration=fill_dur, obs.period=fill_obs.p)
rbind(df,rows_to_add)
}
ddply(data, "id", fill1)
The result is not sorted by id, duration, however.
精彩评论