I got a transaction database which looks like this:
AccountID PaymentDate PaymentAmount
8 13 2020-02-09 1.00
9 13 2020-01-25 4.20
10 14 2020-01-01 30.68
11 14 2020-02-01 30.68
PaymentDate is in posix format. With transaction data I don't want to aggregate by time intervals (this is well documented), but by ID.
Using min() with Posix times gives the first day, max() the last day. This is the information I need for every ID.
Ok, here is what I tried:
# 1.
summaryBy(PaymentDate ~ AccountID, data1, FUN=c(min) )
Fehler in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : arguments must have same length
# 2.
ddply( data1, "AccountID", summarise, min(PaymentDate))
# returns 0 and warnings:
50: In output[[var]][rng] <- df[[var]] : Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge
# 3.
aggregate(PaymentDate ~ AccountID, data1, min)
Fehler in model.frame.default(formula = PaymentDate ~ AccountID, data = data1) : ungültiger Typ (list) für die Variable 'PaymentDate'
Apparently, aggregate does not work with posix time if you need aggregates of time rather than aggregates by time.
But getting first and last transaction date must be possible?!
Reply
Ok, since I cannot answer my own question yet, I post it here:
Interesting. Thank you!
I typically use the as.is=T option with read.csv, then convert time using strptime. So when I look at the structure of my data I get:
$ PaymentDate : POSIXlt, format: "2020-02-04" "2020-02-04" "2020-02-04" ...
To me, that does not look like a factor. I can use min() and max() on the whole column and it works. Apparently, POSIXlt is more troublesome than I thought. Coming from the POSIXlt, I did
data$PaymentDate=as.Date(data$PaymentDate)
Looking at the structure, the Class is correctly set as Date.
$ PaymentDate :Class 'Date' num [1:10000] 18296 18296 18296 18297 18297 ...
Now it appears to work. However, only ddply return the right format "2020-01-25", while both aggregate and summaryBy return in the "18286" format. Was that days since 1970-01-01? Well, I guess I can convert it back.
foo=aggregate(PaymentDate ~ AccountID, data1, min)
as.Date(foo$PaymentDate,origin="1970-01-01")
However, there has to be some explanation. Also, ddply is much slower.
Oh, why do I use strptime first? Well, the date in the original file is in a different format, "%d-%m-%y". Using as.Date directly on this did not appear to work.
EDIT
dput of my data
structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
mday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 9L, 25L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 4L, 17L, 8L, 17L, 28L, 8L, 22L, 3L),
mon = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 3L,
3L, 5L, 6L, 6L, 7L, 8L, 8L, 9L, 9L, 11L, 11L, 1L, 2L, 5L,
7L, 10L, 10L, 4L, 0L, 4L, 6L, 3L, 2L, 11L, 11L, 4L, 10L),
year = c(110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L,
110L, 109L, 110L, 110L, 109L, 110L, 110L, 109L, 110L, 109L,
109L, 110L, 109L, 110L, 109L, 110L, 109L, 109L, 109L, 110L,
109L, 110L, 110L, 110L, 109L, 109L, 110L, 109L, 109L, 110L,
109L, 109L), wday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 2L, 1L,
4L, 1L, 1L, 3L, 4L, 2L, 3L, 4L, 6L, 2L, 3L, 4L, 5L, 2L, 3L,
1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 5L, 4L, 2L, 1L, 3L, 5L,
2L), yday = c(34L, 34L, 34L, 35L, 35L, 35L, 35L, 39L, 24L,
0L, 31L, 59L, 90L, 90L, 151L, 181L, 181L, 212L, 243L, 243L,
273L, 273L, 334L, 334L, 32L, 60L, 152L, 213L, 305L, 305L,
122L, 3L, 123L, 197L, 97L, 75L, 361L, 341L, 141L, 306L),
isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT")), .Names = c("AccountID",
"PaymentDate"), row.names = c(NA, 40L), class = "data.frame")
dput after doing as you suggested:
structure(list(AccountID = c(359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L,
997L, 997L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L,
1181L, 1181L, 1181L), PaymentDate = structure(c(14245, 14277,
14305, 14335, 14368, 14397, 14426, 14457, 14488, 14518, 14550,
14579, 14613, 14641, 14669, 14700, 14732, 14761, 14791, 14823,开发者_JAVA百科
14853, 14883, 14915, 14944, 14442, 14320, 14606, 14707, 14386,
14951, 14293, 14432, 14477, 14540, 14540, 14540, 14540, 14540,
14540, 14551), class = "Date")), .Names = c("AccountID", "PaymentDate"
), row.names = c(10L, 25L, 26L, 13L, 33L, 27L, 16L, 18L, 19L,
21L, 29L, 23L, 32L, 11L, 12L, 14L, 31L, 15L, 17L, 28L, 20L, 22L,
30L, 24L, 34L, 36L, 37L, 35L, 39L, 38L, 45L, 42L, 48L, 50L, 51L,
52L, 53L, 54L, 55L, 40L), class = "data.frame")
dput of raw data
structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = c("04-02-2010", "04-02-2010", "04-02-2010",
"05-02-2010", "05-02-2010", "05-02-2010", "05-02-2010", "09-02-2010",
"25-01-2010", "01-01-2009", "01-02-2010", "01-03-2010", "01-04-2009",
"01-04-2010", "01-06-2010", "01-07-2009", "01-07-2010", "01-08-2009",
"01-09-2009", "01-09-2010", "01-10-2009", "01-10-2010", "01-12-2009",
"01-12-2010", "02-02-2009", "02-03-2009", "02-06-2009", "02-08-2010",
"02-11-2009", "02-11-2010", "03-05-2010", "04-01-2010", "04-05-2009",
"17-07-2009", "08-04-2010", "17-03-2009", "28-12-2009", "08-12-2010",
"22-05-2009", "03-11-2009")), .Names = c("AccountID", "PaymentDate"
), row.names = c(NA, 40L), class = "data.frame")
The problem is that your data, and in particular the PaymentDate column, is a factor. Both your ddply
and aggregate
solutions will work as written if you first transform the PaymentDate column:
#Recreate data and use dput() to replicate
df <- structure(list(AccountID = c(13L, 13L, 14L, 14L), PaymentDate = c("2020-02-09",
"2020-01-25", "2020-01-01", "2020-02-01"), PaymentAmount = c(1,
4.2, 30.68, 30.68)), .Names = c("AccountID", "PaymentDate", "PaymentAmount"
), class = "data.frame", row.names = c("8", "9", "10", "11"))
Change variable class to Date.
df$PaymentDate <- as.Date(df$PaymentDate)
Then run your original code. Using ddply:
ddply(df, .(AccountID), summarize, data=min(PaymentDate))
AccountID data
1 13 2020-01-25
2 14 2020-01-01
Using aggregate:
aggregate(PaymentDate ~ AccountID, df, min)
AccountID PaymentDate
1 13 2020-01-25
2 14 2020-01-01
There is another, more general way to avoid this problem. By default, when you create a data.frame by using read.table
(or its variants like read.csv
) the parameter stringsAsFactors
is set to TRUE
. When I recreate your data using stringsAsFactors=FALSE
, you don't need the intermediate step of transforming PaymentDate, and your code works as written:
dat <- " AccountID PaymentDate PaymentAmount
8 13 2020-02-09 1.00
9 13 2020-01-25 4.20
10 14 2020-01-01 30.68
11 14 2020-02-01 30.68 "
df <- read.table(textConnection(dat), stringsAsFactors=FALSE)
df
ddply(df, .(AccountID), summarize, data=min(PaymentDate))
AccountID data
1 13 2020-01-25
2 14 2020-01-01
精彩评论