I have a data开发者_Python百科 frame with three columns: timestamp, key, event which is ordered by time.
ts,key,event
3,12,1
8,49,1
12,42,1
46,12,-1
100,49,1
From this, I want to create a data frame with timestamp and (all unique keys - all unique keys with cumulative sum 0 up until a given timestamp) divided by all unique keys until the same timestamp. E.g. for the above example the result should be:
ts,prob
3,1
8,1
12,1
46,2/3
100,2/3
My initial step is to calculate the cumsum grouped by key:
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
sumByKey = ddply(items, .(key), transform, sum=cumsum(event))
In the second (and final) step i iterate over sumByKey
with a for-loop and keep track of both all unique keys and all unique keys that have a 0 in their sum using vectors, e.g. if(!(k %in% uniqueKeys) uniqueKeys = append(uniqueKeys, key)
. The prob is derived using the two vectors.
Initially, i tried to solve the second step using plyr, but i wanted to avoid re-calculating the unique keys up to a certain timestamp for each row in sumByKey
. What im missing is a way to either refer to external variables from a function passed to ddply. Or, alternatively (and more functional), use an accumulator passed back into the function, e.g. function(acc, x) acc + x.
Is it possible to solve the second step in a better way, using e.g. ddply?
If my interpretation is right, then this should do it :
items = data.frame(ts=c(3,8,12,46,100), key=c(12,49,42,12,49), event=c(1,1,1,-1,1))
# numbers of keys that sum to zero, no ddply necessary
nzero <- cumsum(ave(items$event,items$key,FUN=cumsum)==0)
# number of unique keys at a given timepoint
nunique <- rep(F,length(items$key))
nunique[match(unique(items$key),items$key)] <- T
nunique <- cumsum(nunique)
# makes :
items$p <- (nunique-nzero)/nunique
items
ts key event p
1 3 12 1 1.0000000
2 8 49 1 1.0000000
3 12 42 1 1.0000000
4 46 12 -1 0.6666667
5 100 49 1 0.6666667
If your problem is only computational time, I bet the better idea will be to implement your algorithm as a C chunk; you may first use R to convert keys to a coherent interval of integers (as.numeric(factor(...))
) and then use boolean array in C to obtain unique key number easily and very fast. Remember that neither plyr nor standard R *pply
s are significantly faster than loops (providing both are used without embarrassing errors, of course).
精彩评论