Or rather, How can I do it better than I have fudged.
I have a dataframe with names and start and end dates in a group. I want to produce a dataframe with the number of people in the group over time. Note, some people haven't left yet (end date is NA)
Here's an example dataset
foo<-data.frame(name=c("Bob","Sue", "Richard", "Jane"),
Start=as.POSIXct(c("2006-03-23 GMT", "2007-01-20 GMT", "2007-01-20 GMT", "2006-03-23 GMT")),
End=as.POSIXct(c("2009-01-20 GMT", "NA", "2006-03-23 GMT", "NA")))
Here I create a dataframe with dates covering the range I want. This feels very dirty.
daterange<-data.frame(date=as.POSIXct(
paste(
rep(2006:2009, each=12),
"-",
rep(01:12, times=4),
"-",
1,
" GMT",
sep="")
)
)
#cheat by setting NAs to soemthing far away
foo$End[is.na(foo$End)]<-as.POSIXct(Sys.time())+(365*24*60*60)
Now I use ddply to produce the result.
ddply(.data=daterange, .variable="date", function(df) {
开发者_运维问答 result=nrow(subset(foo, Start<df$date & End>df$date))
return(result)
})
There must be an easier way ?
Here is an alternate approach using plyr
. It directly works with the original data frame foo
and does not require converting NA
into a date. The code is self-explanatory and very readable. Any comments are welcome.
dates = seq(as.POSIXct('2006-01-01'), as.POSIXct('2009-12-01'), by = "month")
count = ldply(dates, function(d)
with(foo, sum((Start < d) + (d < End | is.na(End)) == 2)))
data.frame(dates, count)
Here's a different way that you might find easier:
foo<-data.frame(name=c("Bob","Sue", "Richard", "Jane"),
Start=as.POSIXct(c("2006-03-23 GMT", "2007-01-20 GMT", "2007-01-20 GMT", "2006-03-23 GMT")),
End=as.POSIXct(c("2009-01-20 GMT", NA, "2006-03-23 GMT", NA)))
tmp <- expand.grid(foo$name,seq.POSIXt(as.POSIXct('2006-01-01'),
as.POSIXct('2009-12-01'),by="month"))
colnames(tmp) <- c('name','date')
foo[is.na(foo)] <- max(tmp$date) + 1
tmp1 <- merge(tmp,foo,by="name")
tmp2 <- tmp1$Start <= tmp1$date & tmp1$End >= tmp1$date
aggregate(tmp2,by=list(date=tmp1$date),sum)
My two cents here are to use seq.*
rather than paste
ing dates together and that ddply
is kind of an awkward tool if you're really just taking daterange one element at a time. I used aggregate
, but you could have used lapply
or something like that.
You could compress this into fewer lines if you really want to, but readability will suffer.
精彩评论