开发者

Remove NA values from a vector

开发者 https://www.devze.com 2023-04-12 00:50 出处:网络
I have a huge vector which has a couple of NA values, and I\'m trying to find the max value in that vector (t开发者_StackOverflow中文版he vector is all numbers), but I can\'t do this because of the NA

I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (t开发者_StackOverflow中文版he vector is all numbers), but I can't do this because of the NA values.

How can I remove the NA values so that I can compute the max?


Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)

Setting na.rm=TRUE does just what you're asking for:

d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)

If you do want to remove all of the NAs, use this idiom instead:

d <- d[!is.na(d)]

A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.


The na.omit function is what a lot of the regression routines use internally:

vec <- 1:1000
vec[runif(200, 1, 1000)] <- NA
max(vec)
#[1] NA
max( na.omit(vec) )
#[1] 1000


Use discard from purrr (works with lists and vectors).

discard(v, is.na) 

The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [:

v %>% discard(is.na)
v %>% `[`(!is.na(.))

Note that na.omit does not work on lists:

> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1

$b
[1] 2

$c
[1] NA


?max shows you that there is an extra parameter na.rm that you can set to TRUE.

Apart from that, if you really want to remove the NAs, just use something like:

myvec[!is.na(myvec)]


Just in case someone new to R wants a simplified answer to the original question

How can I remove NA values from a vector?

Here it is:

Assume you have a vector foo as follows:

foo = c(1:10, NA, 20:30)

running length(foo) gives 22.

nona_foo = foo[!is.na(foo)]

length(nona_foo) is 21, because the NA values have been removed.

Remember is.na(foo) returns a boolean matrix, so indexing foo with the opposite of this value will give you all the elements which are not NA.


You can call max(vector, na.rm = TRUE). More generally, you can use the na.omit() function.


I ran a quick benchmark comparing the two base approaches and it turns out that x[!is.na(x)] is faster than na.omit. User qwr suggested I try purrr::dicard also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)

microbenchmark::microbenchmark(
  purrr::map(airquality,function(x) {x[!is.na(x)]}), 
  purrr::map(airquality,na.omit),
  purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
  times = 1e6)

Unit: microseconds
                                                     expr    min     lq      mean median      uq       max neval cld
 purrr::map(airquality, function(x) {     x[!is.na(x)] })   66.8   75.9  130.5643   86.2  131.80  541125.5 1e+06 a  
                          purrr::map(airquality, na.omit)   95.7  107.4  185.5108  129.3  190.50  534795.5 1e+06  b 
  purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06   c

For reference, here's the original test of x[!is.na(x)] vs na.omit:

microbenchmark::microbenchmark(
    purrr::map(airquality,function(x) {x[!is.na(x)]}), 
    purrr::map(airquality,na.omit), 
    times = 1000000)


Unit: microseconds
                                              expr  min   lq      mean median    uq      max neval cld
 map(airquality, function(x) {     x[!is.na(x)] }) 53.0 56.6  86.48231   58.1  64.8 414195.2 1e+06  a 
                          map(airquality, na.omit) 85.3 90.4 134.49964   92.5 104.9 348352.8 1e+06   b


Another option using complete.cases like this:

d <- c(1, 100, NA, 10)
result <- complete.cases(d)
output <- d[result]
output
#> [1]   1 100  10
max(output)
#> [1] 100

Created on 2022-08-26 with reprex v2.0.2

0

精彩评论

暂无评论...
验证码 换一张
取 消