I want to filter out all values of var3 < 5 while keeping at least one occurrence of each value of var1.
> foo <- data.frame(var1=c(1, 1, 8, 8, 5, 5, 5), var2=c(1,2,3,2,4,6,8), var3=c(7,1,1,1,1,1,6))
> foo
var1 var2 var3
1 1 1 开发者_开发百科 7
2 1 2 1
3 8 3 1
4 8 2 1
5 5 4 1
6 5 6 1
7 5 8 6
subset(foo, (foo$var3>=5))
would remove row 2 to 6 and I would have lost var1==8.
- I want to remove the row if there is another value of var1 that fulfills the condition foo$var3 >= 5. See row 5.
- I want to keep the row, assiging NA to var2 and var3 if all occurrences of a value var1 do not fulfill the condition foo$var3 >= 5.
This is the result I expect:
var1 var2 var3
1 1 1 7
3 8 NA NA
7 5 8 6
This is the closest I got:
> foo$var3[ foo$var3 < 5 ] = NA
> foo$var2[ is.na(foo$var3) ] = NA
> foo
var1 var2 var3
1 1 1 7
2 1 NA NA
3 8 NA NA
4 8 NA NA
5 5 NA NA
6 5 NA NA
7 5 8 6
Now I just need to know how to conditionally remove the right rows (2, 3 or 4, 5, 6): Remove the row if var2 & var3 are NA and if the value of var1 has more than 1 occurrence.
But there is surely a much simpler/elegant way to approach this little problem.
edit: changed foo
to resemble my use case more
The fastest way is to use merge:
> merge(foo[foo$var3>5,],unique(foo$var1),by.x=1,by.y=1,all.y=T)
var1 var2 var3
1 1 1 7
2 5 8 6
3 8 NA NA
unique(foo$var1)
gives the unique values in var1. These ones are mapped against the dataframe where var3 is larger than five. You take the first column of every argument (all.x=1, all.y=1) and you say that all values in y should be represented (all.y=T). See also ?merge
.
If you want to preserve the order, then :
> merge(foo[foo$var3>5,],unique(foo$var1),by.x=1,by.y=1,
+ all.y=T)[order(unique(foo$var1)),]
var1 var2 var3
1 1 1 7
3 8 NA NA
2 5 8 6
merge sorts the variable on which the mapping happens. order
gives this sorting, so you can reverse it using that order as indices. See also ?order
.
After you do:
foo$var3[ foo$var3 < 5 ] = NA
foo$var2[ is.na(foo$var3) ] = NA
You need to remove rows containing NA that are also duplicate values of var1:
foo[!(!complete.cases(foo) & duplicated(foo$var1)), ]
Think of this line as identifying lines that contain NA values AND duplicate var1 values, then selecting everything else.
Edit: If the first row in a dataframe for a given value of var1 has a value of var3 that you want to exclude, my solution doesn't work. You'll need to order the data.frame first to make sure that the complete cases come first:
foo <- foo[order(foo$var2),] # ordering on var3 should be the same
foo[!(!complete.cases(foo) & duplicated(foo$var1)), ]
rbind(r <- subset(foo, (foo$var3>=5)),
unique(transform(subset(foo, !var1%in%r$var1), var2=NA, var3=NA)))
step-by-step:
r <- subset(foo, (foo$var3>=5))
r2 <- subset(foo, !var1%in%r$var1) # extract var1 != r$var1
r3 <- transform(r2, var2=NA, var3=NA) # replace var2 and var3 with NA
r4 <- unique(r3) # remove duplicates
rbind(r, r4) # bind them
Here's a way using the plyr
package functions ddply
and colwise
, and the subset
function. First define a helper function null2na
:
null2na <- function(x) if ( length(x) == 0 ) NA else x
Next define the function filter
that we want to apply to each sub-data-frame that has a specific value for var1
:
filter <- function(df) cbind( data.frame( var1 = df[1,1]),
colwise(null2na) (subset(df, var3 >= 5)[,-1]))
Now do the ddply
on foo
by var1
:
> ddply(foo, .(var1), filter)
var1 var2 var3
1 1 1 7
2 5 8 6
3 8 NA NA
Try this:
foo <- data.frame(var1= c(1, 1, 2, 3, 3, 4, 4, 5),
var2=c(9, 5, 13, 9, 12, 11, 13, 9),
var3=c(6, 8, 3, 6, 4, 7, 2, 9))
f2=foo[which(foo$var3>5),]
missing = which(!(foo$var1 %in% f2$var1))
f3 = rbind(f2, list(foo$var1[missing], rep(NA, length(missing)),rep(NA,length(missing))))
f3[order(f3$var1),]
The last row is only needed if you care about the order (assuming that the data is ordered on var1 in the first place=.
精彩评论