开发者

Loops inefficiency in R

开发者 https://www.devze.com 2023-01-04 18:19 出处:网络
Good morning, I have been developing for a few months in R and I have to make sure that the execution time of my code is not too long because I analyze big datasets.

Good morning,

I have been developing for a few months in R and I have to make sure that the execution time of my code is not too long because I analyze big datasets.

Hence, I have been trying to use as much vectorized functio开发者_如何转开发ns as possible.

However, I am still wondering something.

What is costly in R is not the loop itself right? I mean, the problem arises when you start modifying variables within the loop for example is that correct?

Hence I was thinking, what if you simply have to run a function on each element (you actually do not care about the result). For example to write data in a database. What should you do?

1) use mapply without storing the result anywhere?

2) do a loop over the vector and only apply f(i) to each element?

3) is there a better function I might have missed?

(that's of course assuming your function is not optimally vectorized).

What about the foreach package? Have you experienced any performance improvement by using it?


Just a couple of comments. A for loop is roughly as fast as apply and its variants, and the real speed-ups come when you vectorise your function as much as possible (that is, using low-level loops, rather than apply, which just hides the for loop). I'm not sure if this is the best example, but consider the following:

> n <- 1e06
> sinI <- rep(NA,n)
> system.time(for(i in 1:n) sinI[i] <- sin(i))
   user  system elapsed 
  3.316   0.000   3.358 
> system.time(sinI <- sapply(1:n,sin))
   user  system elapsed 
  5.217   0.016   5.311 
> system.time(sinI <- unlist(lapply(1:n,sin),
+       recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.284   0.012   1.303 
> system.time(sinI <- sin(1:n))
   user  system elapsed 
  0.056   0.000   0.057 

In one of the comments below, Marek points out that the time consuming part of the for loop above is actually the ]<- part:

> system.time(sinI <- unlist(lapply(1:n,sin),
+       recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.284   0.012   1.303 

The bottlenecks which can't immediately be vectorised can be rewritten in C or Fortran, compiled with R CMD SHLIB, and then plugged in with .Call, .C or .Fortran.

Also, see these links for more info about loop optimisation in R. Also check out the article "How Can I Avoid This Loop or Make It Faster?" in R News.


vapply avoids the post-processing by requiring that you specify what the return value is. It turns out to be 3.4 times faster than the for-loop:

> system.time(for(i in 1:n) sinI[i] <- sin(i))
   user  system elapsed 
   2.41    0.00    2.39 

> system.time(sinI <- unlist(lapply(1:n,sin), recursive = FALSE, use.names = FALSE))
   user  system elapsed 
   1.46    0.00    1.45 

> system.time(sinI <- vapply(1:n,sin, numeric(1)))
   user  system elapsed 
   0.71    0.00    0.69 
0

精彩评论

暂无评论...
验证码 换一张
取 消