开发者

R Grouping/Aggregation where the condition involves other rows in the table, not just the current row

开发者 https://www.devze.com 2023-03-16 10:25 出处:网络
Using R, what is the开发者_如何学Go best way I can aggregate rows on a condition that spans multiple rows.

Using R, what is the开发者_如何学Go best way I can aggregate rows on a condition that spans multiple rows. For example to aggregate any rows where z = 0 for n or more times.

What this would look like run on the following sample table with n = 3.

Sample Table x:

x   y   z
0   0   6
5   5   0
40  2   0
4   0   0
10  0   1
0   0   2
11  7   0
0   4   0
0   0   0
0   0   0
0   0   2
18  0   4

Results Table:

x   y   z
0   0   6
49  7   0 <- Above two rows got aggregated
10  0   1
0   0   2
11  11  0 <- Above three rows got aggregated
0   0   2
18  0   4


Since it seems like you're still in the "leaRning phase", I thought an example using the plyr package would be helpful. plyr is an extremely handy library which allows you to slice/dice datasets and summarize their subgroups in a flexible (and terse -- as you'll see below) manner, so it would likely be worth your time to get to know. If you find yourself needing to do similar operations on extremely large data sets, you might also consider looking into the data.table package.

I'm assuming you've done Roman's textConnection trick to get your data into a data.frame named mmf. I'm adding an idx column to mmf so you can subset it and process the results group by group:

library(plyr)
# mmf <- read.table(textConnection( ...
rle.idx <- rle(mmf$z)
mmf$idx <- rep(seq(RLE$lengths), RLE$lengths)
ans <- ddply(mmf, .(idx), colwise(sum))

And ans looks like:

 x  y z idx
 0  0 6   1
49  7 0   6
10  0 1   3
 0  0 2   4
11 11 0  20
 0  0 2   6
18  0 4   7

Just remove the idx column and you're done, eg:

ans <- ans[, -4]


This is the code I used to produce your result. If you have any questions, fire away.

mmf <- read.table(textConnection("x   y   z # read in your example data
0   0   6
5   5   0
40  2   0
4   0   0
10  0   1
0   0   2
11  7   0
0   4   0
0   0   0
0   0   0
0   0   2
18  0   4"), header = TRUE)

# see where there are zeros in the y column
mmf.rle <- rle(mmf$z) 
mmf.rle <- data.frame(lengths = mmf.rle$lengths, values = mmf.rle$values)

merge.rows <- 3
# select rows that have more or equal to three zeros
mmf.zero <- which(mmf.rle$values == 0 & mmf.rle$lengths >= merge.rows)

for (i in mmf.zero) {
# find which positions are zero, calculate sums and insert the result into a data.frame where the rows in question were turned to NA
    m.mmf <- mmf.rle$lengths[1:i] # select elements from 1 to where the zero appears
    select.rows <- (sum(m.mmf[1:length(m.mmf) - 1])+1):sum(m.mmf) # magic
    mmf.sum <- colSums(mmf[select.rows, ]) # sum values column-wise for rows that have at least three zeros in z
    mmf[select.rows,] <- NA # now that we have a sum by columns, we turn those numbers into NAs...
    mmf[select.rows[1], ] <- mmf.sum # ... and insert summed result into the first NA row       
}

# remove any left over NA rows
mmf <- mmf[complete.cases(mmf),]


DATA

mmf <- read.table(textConnection("x y z # read in your example data 0 0 6 5 5 0 40 2 0 4 0 0 10 0 1 0 0 2 11 7 0 0 4 0 0 0 0 0 0 0 0 0 2 18 0 4"), header = TRUE)

CODE

agg_n <- function(dat=mmf,coln="z",n=3){
    agg <- function(.x) {
        # Sum values if first n=3 records in column coln="z" are 0 
        if(all(.x[[coln]][seq(n)] == 0)) {
            y <- rbind(colSums(.x[seq(n),]),.x[-1*seq(n),])
        } else y <- .x
        return(y)
    }
    # Groups of records starting with 0 in column coln="z"
    G <- cumsum(diff(c(0L,dat[[coln]] == 0))==1)
    new_dat <- do.call(rbind,lapply(split(dat,G),agg))
    return(new_dat)
}

OUTPUT

> agg_n()
      x  y z
0     0  0 6
1.1  49  7 0
1.5  10  0 1
1.6   0  0 2
2.1  11 11 0
2.10  0  0 0
2.11  0  0 2
2.12 18  0 4
0

精彩评论

暂无评论...
验证码 换一张
取 消