开发者

Last Observation Carried Forward In a data frame? [duplicate]

开发者 https://www.devze.com 2022-12-29 20:50 出处:网络
This question already has answers here: Replacing NAs with latest non-NA value (21 answers) Closed 4 years ago.
This question already has answers here: Replacing NAs with latest non-NA value (21 answers) Closed 4 years ago. 开发者_如何转开发

I wish to implement a "Last Observation Carried Forward" for a data set I am working on which has missing values at the end of it.

Here is a simple code to do it (question after it):

LOCF <- function(x)
{
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}


# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))

Now this works great for simple vectors. But if I where to try and use it on a data frame:

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess

It will turn my data frame into a character matrix.

Can you think of a way to do LOCF on a data.frame, without turning it into a matrix? (I could use loops and such to correct the mess, but would love for a more elegant solution)


This already exists:

library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))


If you do not want to load a big package like zoo just for the na.locf function, here is a short solution which also works if there are some leading NAs in the input vector.

na.locf <- function(x) {
  v <- !is.na(x)
  c(NA, x[v])[cumsum(v)+1]
}


Adding the new tidyr::fill() function for carrying forward the last observation in a column to fill in NAs:

a <- data.frame(col1 = rep("a",4), col2 = 1:4, 
                col3 = 1:4, col4 = c(1,NA,NA,NA))
a
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2   NA
# 3    a    3    3   NA
# 4    a    4    4   NA

a %>% tidyr::fill(col4)
#   col1 col2 col3 col4
# 1    a    1    1    1
# 2    a    2    2    1
# 3    a    3    3    1
# 4    a    4    4    1


There are a bunch of packages implementing exactly this functionality. (with same basic functionality, but some differences in additional options)

  • spacetime::na.locf
  • imputeTS::na_locf
  • zoo::na.locf
  • xts::na.locf
  • tidyr::fill

Added a benchmark of these methods for @Alex:

I used the microbenchmark package and the tsNH4 time series, which has 4552 observations. These are the results:

Last Observation Carried Forward In a data frame? [duplicate]

So for this case na_locf from imputeTS was the fastest - closely followed by na.locf0 from zoo. The other methods were significantly slower. But be careful it is only a benchmark made with one specific time series. (added the code that you can test for your specific use case)

Results as a plot:

Last Observation Carried Forward In a data frame? [duplicate]

Here is the code, if you want to recreate the benchmark with a self selected time series:

library(microbenchmark)
library(imputeTS)
library(zoo)
library(xts)
library(spacetime)
library(tidyr)

# Create a data.frame from tsNH series 
df <- as.data.frame(tsNH4)

res <- microbenchmark(imputeTS::na_locf(tsNH4),
                    zoo::na.locf0(tsNH4),
                    zoo::na.locf(tsNH4), 
                    tidyr::fill(df, everything()), 
                    spacetime::na.locf(tsNH4), 
                    times = 100)
ggplot2::autoplot(res)

plot(res)

# code just to show each methods produces correct output
spacetime::na.locf(tsNH4)
imputeTS::na_locf(tsNH4)
zoo::na.locf(tsNH4)
zoo::na.locf0(tsNH4)
tidyr::fill(df, everything())


This question is old but for posterity... the best solution is to use data.table package with the roll=T.


I ended up solving this using a loop:

fillInTheBlanks <- function(S) {
  L <- !is.na(S)
  c(S[L][1], S[L])[cumsum(L)+1]
}


LOCF.DF <- function(xx)
{
    # won't work well if the first observation is NA

    orig.class <- lapply(xx, class)

    new.xx <- data.frame(t( apply(xx,1, fillInTheBlanks) ))

    for(i in seq_along(orig.class))
    {
        if(orig.class[[i]] == "factor") new.xx[,i] <- as.factor(new.xx[,i])
        if(orig.class[[i]] == "numeric") new.xx[,i] <- as.numeric(new.xx[,i])
        if(orig.class[[i]] == "integer") new.xx[,i] <- as.integer(new.xx[,i])   
    }

    #t(na.locf(t(a)))

    return(new.xx)
}

a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
LOCF.DF(a)


Instead of apply() you can use lapply() and then transform the resulting list to data.frame.

LOCF <- function(x) {
    # Last Observation Carried Forward (for a left to right series)
    LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
    x[LOCF:length(x)] <- x[LOCF]
    return(x)
}

a <- data.frame(rep("a",4), 1:4, 1:4, c(1, NA, NA, NA))
a
data.frame(lapply(a, LOCF))
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号