R - Probability of date differences_问答_开发者

开发者 https://www.devze.com 2023-04-04 20:54 出处：网络

Given df below, I want to get the time between requests, and then get a textual output of a histogram of probabilities that a request will come between 1 second apart, 2 seconds apart, 3 seconds apart

相关专题：

Given df below, I want to get the time between requests, and then get a textual output of a histogram of probabilities that a request will come between 1 second apart, 2 seconds apart, 3 seconds apart, etc.. until 10 seconds. I want to use all of the data when calculating the probabilities, but I only want to see the first 10 seconds of data.

I've tried to get help with this on the ML, but could not. I've received great help on here, so I hope I'm not abusing the help. This should be my last question. Thanks a lot.

df <- read.csv(textConnection('
"SOURCE","REQUEST_DATE"
"A","09/11/2011 09:28:48"
"A","09/11/2011 09:28:47"
"A","09/11/2011 09:15:42"
"A","09/11/2011 09:15:41"
"D","09/13/2011 09:06:53"
"D","09/13/2011 09:06:52"
"D","09/13/2011 08:56:55"
"D","09/13/2011 08:56:52"
"D","09/13/2011 0开发者_运维技巧8:55:43"
"D","09/13/2011 08:39:07"
'), stringsAsFactors=FALSE)

And here's how I'm getting the diff, with the excellent help of Andrie:

df_diff <- ddply(df, .(SOURCE), summarize, TIME_DIFF=-unclass(diff(REQUEST_DATE)))

So, I want something like the following (with made up results)

A 1 55%
A 2 15%
A 3 10%
...
A 10 5%
D 1 10%
D 2 12%
D 3 15%
...
D 10 1%

D 5013 2%, for example, would get cut off, because I only want the top 10 for each source.

The "histogram as text" part is confusing me, but I am guessing you actually want to tabulate within one second breaks:

 df_diff$tdiff_grp <- cut(df_diff$TIME_DIFF, 0:10, right=FALSE)
 with(df_diff, tapply(tdiff_grp, SOURCE, table))
$A
 [0,1)  [1,2)  [2,3)  [3,4)  [4,5)  [5,6)  [6,7)  [7,8)  [8,9) [9,10) 
     0      2      0      0      0      0      0      0      0      0 

$D
 [0,1)  [1,2)  [2,3)  [3,4)  [4,5)  [5,6)  [6,7)  [7,8)  [8,9) [9,10) 
     0      1      0      1      0      0      0      0      0      0

After you clarify what is actually desired, it would be a simple matter to use either prop.table or divide these by their sums (and then multiply by 100) to produce percentages.

EDIT: A simple function can return percentages:

> tbls <- with(df_diff, tapply(tdiff_grp, SOURCE,table))
> lapply(tbls, function(x) 100*x/sum(x) )
$A
 [0,1)  [1,2)  [2,3)  [3,4)  [4,5)  [5,6)  [6,7)  [7,8)  [8,9) [9,10) 
     0    100      0      0      0      0      0      0      0      0   

$D    
 [0,1)  [1,2)  [2,3)  [3,4)  [4,5)  [5,6)  [6,7)  [7,8)  [8,9) [9,10) 
     0     50      0     50      0      0      0      0      0      0