I have a long data frame with three columns fyear
, tic
, and dcvt
(for fiscal year, ticker, and total convertible debt). There are about 18 fiscal years and a few thousand tickers. I would like to add an indicator variable that is one w开发者_如何学运维henever dcvt
goes up from one year to the next.
I tried ddply
, but I lost the fyear
column and wasn't sure how to get it back.
library(plyr)
temp <- data.frame(fyear = rep(1992:2009, 10), tic = rep(letters[1:10], each = 18), dcvt = rnorm(180, 200, 10))
my.fun <- function(x) x <- c(0, ifelse(tail(x, -1) - head(x, -1) > 0, 1, 0))
temp2 <- ddply(temp, "tic", colwise(my.fun, "dcvt"))
I also tried to cast to wide with the reshape2
package, then run for
loops, but of course, that took forever.
Is there a way that I can do this quickly? Should I make a wide zoo
object then use diff
? I would like to avoid passing through a time series, if I can. Thanks!
using tranform in ddply sometimes help us greatly:
ddply(temp, .(tic), transform, dcvt=c(0, diff(dcvt)>0))
ddpy()
handles a data set of this size (10^2) quite well. However, for larger datasets and for situations where you don't necessarily need to return a full dataframe, I would consider the following do.call
+ lapply
solution:
my.fun <- function(cur.tic){
as.numeric(diff(temp$dcvt[temp$tic == cur.tic]) > 0)
}
do.call("c", lapply(unique(temp$tic), my.fun))
To demonstrate the performance payoffs (unfairly given the vector vs. dataframe issue), I took the OP's sample data, created new data frames of magnitude 10^4, 10^5, and 10^6, and then ran system.time()
on @kohske's ddply
solution and the solution above:
Original data (10^2):
> system.time(do.call("c", lapply(unique(temp$tic), my.fun)))
user system elapsed
0.000 0.000 0.003
> system.time(ddply(temp, .(tic), transform, dcvt=c(0, diff(dcvt)>0)))
user system elapsed
0.020 0.000 0.013
10^4 sample data
> system.time(do.call("c", lapply(unique(temp.2$tic), my.fun)))
user system elapsed
0.000 0.000 0.002
> system.time(ddply(temp.2, .(tic), transform, dcvt=c(0, diff(dcvt)>0)))
user system elapsed
0.040 0.000 0.036
10^5 sample data
> system.time(do.call("c", lapply(unique(temp.3$tic), my.fun)))
user system elapsed
0.000 0.000 0.004
> system.time(ddply(temp.3, .(tic), transform, dcvt=c(0, diff(dcvt)>0)))
user system elapsed
0.270 0.000 0.279
10^6 sample data
> system.time(do.call("c", lapply(unique(temp.4$tic), my.fun)))
user system elapsed
0.010 0.000 0.018
> system.time(ddply(temp.4, .(tic), transform, dcvt=c(0, diff(dcvt)>0)))
user system elapsed
6.110 0.070 6.186
Not a gripe about ddply()
- rather, just an effort to share some code that I found useful while working on a very similar issue with a much larget dataset recently.
精彩评论