I have a CSV file with timestamps and certain event-types which happened at this time. What I want is count the number of occurences of certain event-types in 6-minutes intervals.
The input-data looks like:
date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
I load and cure the data with this piece of code:
> raw_data <- read.csv('input.csv')
> cured_dates <- c(strptime(raw_data$date, '%b %d, %Y %H:%M:%S', tz="CEST"))
> cured_data <- data.frame(cured_dates, c(raw_data$type))
> colnames(cured_data) <- c('date', 'type')
After curing the data looks like this:
> head(cured_data)
date type
1 2011-09-22 14:54:53 2
2 2011-09-22 14:54:53 2
3 2011-09-22 14:54:53 2
4 2011-09-22 14:54:53 2
5 2011-09-22 14:54:53 1
6 2011-09-22 14:54:53 1
I read a lot of samples for xts and zoo, but somehow I can't get a hang on it. The output data should look something like:
date type count
2011-09-22 14:54开发者_如何学JAVA:00 CEST 1 11
2011-09-22 14:54:00 CEST 2 19
2011-09-22 15:00:00 CEST 1 9
2011-09-22 15:00:00 CEST 2 12
2011-09-22 15:06:00 CEST 1 23
2011-09-22 15:06:00 CEST 2 18
Zoo's aggregate function looks promising, I found this code-snippet:
# aggregate POSIXct seconds data every 10 minutes
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(x, time(x) - as.numeric(time(x)) %% 600, mean)
Now I'm just wondering how I could apply this on my use case.
Naive as I am I tried:
> zoo_data <- zoo(cured_data$type, structure(cured_data$time, class = c("POSIXt", "POSIXct")))
> aggr_data = aggregate(zoo_data$type, time(zoo_data$time), - as.numeric(time(zoo_data$time)) %% 360, count)
Error in `$.zoo`(zoo_data, type) : not possible for univariate zoo series
I must admit that I'm not really confident in R, but I try. :-)
I'm kinda lost. Could anyone point me into the right direction?
Thanks a lot! Cheers, Alex.
Here the output of dput for a small subset of my data. The data itself is something around 80 million rows.
structure(list(date = structure(c(1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885), class = c("POSIXct", "POSIXt"), tzone = ""),
type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L)), .Names = c("date",
"type"), row.names = c(NA, -23L), class = "data.frame")
We can read it using read.csv
, convert the first column to a date time binned into 6 minute intervals and add a dummy column of 1's. Then re-read it using read.zoo
splitting on the type and aggregating on the dummy column:
# test data
Lines <- 'date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
"Sep 22, 2011 12:54:53.081240000","3"
"Sep 22, 2011 12:54:53.083493000","3"
"Sep 22, 2011 12:54:53.084025000","3"
"Sep 22, 2011 12:54:53.086493000","4"'
library(zoo)
library(chron)
# convert to chron and bin into 6 minute bins using trunc
# Also add a dummy column of 1's
# and remove any leading space (removing space not needed if there is none)
DF <- read.csv(textConnection(Lines), as.is = TRUE)
fmt <- '%b %d, %Y %H:%M:%S'
DF <- transform(DF, dummy = 1,
date = trunc(as.chron(sub("^ *", "", date), format = fmt), "00:06:00"))
# split and aggregate
z <- read.zoo(DF, split = 2, aggregate = length)
With the above test data the solution looks like this:
> z
2 3 4
(09/22/11 12:54:00) 4 3 1
Note that the above has been done in wide form since that form constitutes a time series whereas the long form does not. There is one column for each type. In our test data we had types 2, 3 and 4 so there are three columns.
(We have used chron here since its trunc
method fits well with binning into 6 minute groups. chron does not support time zones which can be an advantage since you can't make one of the many possible time zone errors but if you want POSIXct anyways convert it at the end, e.g. time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z)) %% 1))
. This expression is shown in a table in one of the R News 4/1 articles except we used as.Date.dates
instead of just as.Date
to work around a bug that seems to have been introduced since then. We could also use time(z) <- as.POSIXct(time(z))
but that would result in a different time zone.)
EDIT:
The original solution binned into dates but I noticed afterwards that you wish to bin into 6 minute periods so the solution was revised.
EDIT:
Revised based on comment.
You are almost all the way there. All you need to do now is create a zoo-isch version of that data and map it to the aggregate.zoo code. Since you want to categorize by both time and by type your second argument to aggregate.zoo must be a bit more complex and you want counts rather than means so your should use length(). I do not think that count
is a base R or zoo function and the only count
function I see in my workspace comes from pkg:plyr so I don't know how well it would play with aggregate.zoo. length
works as most people expect for vectors but is often surprises people when working with data.frames. If you do not get what you want with length
, then you should see if NROW
works instead (and with your data layout they both succeed): With the new data object it is necessary to put the type argument first. AND it surns out the aggregate/zoo only handles single category classifiers so you need to put in the as.vector to remove it zoo-ness:
with(cured_data,
aggregate(as.vector(x), list(type = type,
interval=as.factor(time(x) - as.numeric(time(x)) %% 360)),
FUN=NROW)
)
# interval x
#1 2011-09-22 09:24:00 12
#2 2011-09-22 09:24:00 11
This is an example modified from where you got the code (an example on SO by WizaRd Dirk): Aggregate (count) occurences of values over arbitrary timeframe
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(as.vector(x), by=list(cat=as.factor(x),
tms = as.factor(index(x) - as.numeric(index(x)) %% 600)), length)
cat tms x
1 1 1969-12-31 19:00:00 26
2 2 1969-12-31 19:00:00 22
3 3 1969-12-31 19:00:00 11
4 1 1969-12-31 19:10:00 17
5 2 1969-12-31 19:10:00 28
6 3 1969-12-31 19:10:00 15
7 1 1969-12-31 19:20:00 17
8 2 1969-12-31 19:20:00 16
9 3 1969-12-31 19:20:00 27
10 1 1969-12-31 19:30:00 8
11 2 1969-12-31 19:30:00 4
12 3 1969-12-31 19:30:00 9
精彩评论