In R, what exactly is the problem with having variables with the same name as base R functions?_问答_开发者

It seems to be generally considered poor programming practise to use variable names that have functions in base R with the same name.

For example, it is tempting to write:

data <- data.frame(...)
df   <- data.frame(...)

Now, the function data loads data sets while the function df computes the f density function.

Similarly, it is tempting to write:

a <- 1
b <- 2
c <- 3

This is considered bad form because the function c will combine its arguments.

But: In that workhorse of R functions, lm, to compute linear models, data is used as an argument. In other words, d开发者_开发知识库ata becomes an explicit variable inside the lm function.

So: If the R core team can use identical names for variables and functions, what stops us mere mortals?

The answer is not that R will get confused. Try the following example, where I explicitly assign a variable with the name c. R doesn't get confused at all with the difference between variable and function:

c("A", "B")
[1] "A" "B"

c <- c("Some text", "Second", "Third")
c(1, 3, 5)
[1] 1 3 5

c[3]
[1] "Third"

The question: What exactly is the problem with having variable with the same name as base R function?

There isn't really one. R will not normally search objects (non function objects) when looking for a function:

> mean(1:10)
[1] 5.5
> mean <- 1
> mean(1:10)
[1] 5.5
> rm(mean)
> mean(1:10)
[1] 5.5

The examples shown by @Joris and @Sacha are where poor coding catches you out. One better way to write foo is:

foo <- function(x, fun) {
    fun <- match.fun(fun)
    fun(x)
}

Which when used gives:

> foo(1:10, mean)
[1] 5.5
> mean <- 1
> foo(1:10, mean)
[1] 5.5

There are situations where this will catch you out, and @Joris's example with na.omit is one, which IIRC, is happening because of the standard, non-standard evaluation used in lm().

Several Answers have also conflated the T vs TRUE issue with the masking of functions issue. As T and TRUE are not functions that is a little outside the scope of @Andrie's Question.

The problem is not so much the computer, but the user. In general, code can become a lot harder to debug. Typos are made very easily, so if you do :

c <- c("Some text", "Second", "Third")
c[3]
c(3)

You get the correct results. But if you miss somewhere in a code and type c(3) instead of c[3], finding the error will not be that easy.

The scoping can also lead to very confusing error reports. Take following flawed function :

my.foo <- function(x){
    if(x) c <- 1
    c + 1
}

> my.foo(TRUE)
[1] 2
> my.foo(FALSE)
Error in c + 1 : non-numeric argument to binary operator

With more complex functions, this can lead you on a debugging trail leading nowhere. If you replace c with x in the above function, the error will read "object 'x' not found". That will lead a lot faster to your coding error.

Next to that, it can lead to rather confusing code. Code like c(c+c(a,b,c)) asks more from the brain than c(d+c(a,b,d)). Again, this is a trivial example, but it can make a difference.

And obviously, you can get errors too. When you expect a function, you won't get it, which can give rise to another set of annoying bugs :

my.foo <- function(x,fun) fun(x)
my.foo(1,sum)
[1] 1
my.foo(1,c)
Error in my.foo(1, c) : could not find function "fun"

A more realistic (and real-life) example of how this can cause trouble :

x <- c(1:10,NA)
y <- c(NA,1:10)
lm(x~y,na.action=na.omit)
# ... correct output ...
na.omit <- TRUE
lm(x~y,na.action=na.omit)
Error in model.frame.default(formula = x ~ y, na.action = na.omit, 
drop.unused.levels = TRUE) : attempt to apply non-function

Try figuring out what's wrong here if na.omit <- TRUE occurs 50 lines up in your code...

Answer edited after comment of @Andrie to include the example of confusing error reports

R is very robust to this, but you can think of ways to break it. For example, consider this funcion:

foo <- function(x,fun) fun(x)

Which simply applies fun to x. Not the prettiest way to do this but you might encounter this from someones script or so. This works for mean():

> foo(1:10,mean)
[1] 5.5

But if I assign a new value to mean it breaks:

mean <- 1
foo(1:10,mean)

Error in foo(1:10, mean) : could not find function "fun"

This will happen very rarely, but it might happen. It is also very confusing for people if the same thing means two things:

mean(mean)

Since it is trivial to use any other name you want, why not use a different name then base R functions? Also, for some R variables this becomes even more important. Think of reassigning the '+' function! Another good example is reassignment of T and F which can break so much scripts.

I think the problem is when people use these functions in global environment and can cause frustration due to some unexpected error you should not be getting. Imagine you just ran a reproducible example (maybe pretty lengthy one) that overwrote one of the function you're using in your simulation that takes ages to get to where you want it and then suddenly it breaks down with a funny error. Using already existing function names for variables in a closed environment (like a function) are removed after the function closes and should not cause harm. Assuming the programmer is aware of all the consequences of such behavior.

The answer is simple. Well, kind of.

The bottom line is that you should avoid confusion. Technically there is no reason to give your variables proper names, but it makes your code easier to read.

Imagine having a line of code containing something like data()[1] or similar (this line probably doesn't make sense, but it's only an example): although it is clear to you now that you're using function data here, a reader who noticed there being a data.frame named data there, may be confused.

And if you're not altruisticly inclined, remember that the reader could be you in half a year, trying to figure out what you were doing with 'that old code'.

Take it from a man who has learned to use long variable names and naming conventions: it pays back!

I agree with @Gavin Simpson and @Nick Sabbe that there is not really a problem, but that this is more a question of readability of code. Hence, as much things in life, it is a question of convention and consensus.

And I think it is a good convention to give the general advice: Do not name your variables like base R functions!

This advice works like other good advices. For example, we all know that we shall not drink too much booze and do not eat too much unhealthy food, but from time to time we cannot follow these advices and get drunk while eating too much junk food.

The same is true for this advice. It does obviously make sense to name the data argument data. But it makes a lot less sense to name a data vector mean. Although there may be situations in which even this seems appropriate. But try to avoid those situations for clarity.

While some languages might allow it, IF IF THEN THEN ELSE ELSE come to mind. In general it is considered very poor practice. Its not that we don't want to give you the opportunity to show off your advanced knowledge of the language, its that one day, we will have to deal with that code and we are but mortals.

So save your programming tricks from breaking the nightly builds and give your variables reasonable names, with consistent casing if you are feeling extra warm and fuzzy.