I’d like to ask you a question again. It is basically about data frames, NAs and tabulate function in [R].
I have this data frame. I already used this in a previous question of mine. It intentionally looks this simple, my real ’df’ dataframe is much bigger actually and again, I am not willing to annoy anyone with huge databases… So, my database:
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,4)
df <-data.frame(id,a,b,c,d,e)
df
I have managed to calculate the distributions of the numbers occurring in columns ’b’ to ’e’ but considering the fact at the very same time that these distributions should be ’groupped by’ the id numbers in column ’id’. It works fine, check it ->
matrix(matrix(unlist(lapply(df[,(-(1))],
function(x) tapply(x,df$id,tabulate,
nbins=nlevels(factor(df[,2])))) [[1]])),
ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,3])))) [[2]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x)开发者_JAVA百科 tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4])))) [[3]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,5])))) [[4]])),ncol=3,nrow=3,byrow=TRUE)
matrix(matrix(unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,6])))) [[5]])),ncol=4,nrow=3,byrow=TRUE)
Now my problem is: what if my data frame contains NA values here and there and what if I want my in-built tabulate function to collect these NAs as well? So what if I want it to count how many occurrences I have from these NAs?
Here’s my modified data frame with the NAs:
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(NA,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,NA,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,NA,1,4)
df <-data.frame(id,a,b,c,d,e)
df
At first I tried something like this:
unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2],exclude=NULL)))) [[1]])
You see, the only thing I did was that I tried to apply this exclude=NULL
thing.
At least my code realizes the fact that I have 4 different levels in column a
(1,2,3,NA)
and not only three (1,2,3)
. Check it here:
nlevels(factor(df[,2], exclude=NULL))
But you see in the result that somehow it could not calculate the NAs. It says
3 0 6 0 4 3 3 0 4 1 5 0
Instead of the correct:
3 0 6 1 4 3 3 0 4 1 5 0
Or in case of:
unlist(lapply(df[,(-(1))],function(x) tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4],exclude=NULL)))) [[3]])
It says
2 4 4 0 2 3 4 0 1 5 4 0
Instead of the correct
2 4 4 0 2 3 4 1 1 5 4 0
etc.
Does someone have any ideas how to "persuade" the function tabulate to count NAs? Is it possible at all?
Thanks very much and have a pleasant weekend,
Laszlo
You can simplify your repeated calls to:
tabs <-lapply(df[,2:6], function(x, id){ t(table(x, id)) }, df$id)
which gives almost the same as your repeated matrix calls, e.g. for your first (non-NA) one:
> tabs[[1]]
x
id 1 2 3
1 3 0 7
2 4 3 3
3 4 1 5
So can we now modify this to deal with NA
? Yes, using the useNA
argument of the table()
function. Using your df
with NA
, we have:
tabs <-lapply(df[,2:6],
function(x, id){ t(table(x, id, useNA = "ifany")) }, df$id)
> tabs[[1]]
x
id 1 2 3 <NA>
1 3 0 6 1
2 4 3 3 0
3 4 1 5 0
Because we ask for NA
in the table only if an NA
exists, not all the tables in tabs
have the same number of columns. If that is important, we can change useNA = "ifany"
to be useNA = "always"
and all the result tables will have the same number of columns, however it adds another id row:
> tabs[[1]]
x
id 1 2 3 <NA>
1 3 0 6 1
2 4 3 3 0
3 4 1 5 0
<NA> 0 0 0 0
One final addition gets what we want - we use addNA()
to add an NA
level to each id
's set of numbers, even if there are no NA
s recorded:
tabs <-lapply(df[,2:6],
function(x, id){ t(table(addNA(x), id, useNA = "ifany")) }, df$id)
Which gives:
> tabs
$a
id 1 2 3 <NA>
1 3 0 6 1
2 4 3 3 0
3 4 1 5 0
$b
id 1 2 3 <NA>
1 8 1 1 0
2 6 3 1 0
3 2 4 4 0
$c
id 1 2 3 <NA>
1 2 4 4 0
2 2 3 4 1
3 1 5 4 0
$d
id 1 2 3 <NA>
1 2 3 5 0
2 2 6 2 0
3 5 3 2 0
$e
id 1 2 3 4 <NA>
1 4 3 3 0 0
2 4 2 4 0 0
3 1 3 4 1 1
Can't you just use is.na
? If you want to count up the number of entries that are NA or non-zero, you could sum(is.na(my.var)|my.var>0)
.
精彩评论