First without the details
I have data.frame
s like that one:
val1 val2 val3 val4 val5
1 1.1 2 1.1 2.1 4.2
2 5.7 5 5.6 4.9 9.9
3 3.1 3 3.2 2.9 5.9
4 9.6 1 9.5 1.0 2.0
and want to get the (nearly) equal rows. The desired result would be something like
[1] "val1" "val2" "val5"
because the column val3
is almost equal to val1
, val4
is almost equal to val2
and val5
is different.
Details:
- What does "nearly" equal mean (just one of the options listed below):
- the absolute difference of the values is smaller than a fixed number (0.2 for the sample above)
- the relative difference of the values is smaller than a fixed number (~11% for the sample)
- other metrics which make sense ;-)
- a listing of linearly dependent co开发者_JAVA技巧lumns would be even better (but I think that's way more complicated) (that would mean that
val5
is also part of the group which is formed byval2
andval4
since it's roughly twice the value) - it has not to be really fast,
O(n^2)
would be okay. (my frames are only about 12 rows and 300 columns) - if that should not be possible, a list of exactly equal columns would somehow work, too. Then I would apply the
round()
function before
It's not quite well-defined how to choose which rows are equal; for instance, you could have three columns where A and B are "equal" and B and C are "equal" but A and C are not. What to do then? One way around that might be to use hierarchical clustering, maybe like this:
Using the data from Andrie's answer, first transpose it and make it into a matrix; I'll also standardize each row (what was a column) as a start at finding linear combinations; this will group rows that are exact multiple of each other but not more complex combinations.
d <- t(as.matrix(d))
s <- rowSums(d)
ds <- sweep(d, 1, s, `/`)
We now make a tree, and for interest, plot it. This uses the default distance function (Euclidean) but others are possible.
tree <- hclust(dist(ds))
plot(tree)
We then choose where to cut the tree into groups (this is where you choose how close two have to be to be "equal"); I output it together with the sum of values to see if any are multiples of another.
> grp <- cutree(tree, h=0.1)
> cbind(grp, s)
grp s
val1 1 19.5
val2 2 11.0
val3 1 19.4
val4 2 10.9
val5 2 22.0
Replicate your data:
structure(list(val1 = c(1.1, 5.7, 3.1, 9.6), val2 = c(2L, 5L,
3L, 1L), val3 = c(1.1, 5.6, 3.2, 9.5), val4 = c(2.1, 4.9, 2.9,
1), val5 = c(4.2, 9.9, 5.9, 2)), .Names = c("val1", "val2", "val3",
"val4", "val5"), class = "data.frame", row.names = c("1", "2",
"3", "4"))
x
val1 val2 val3 val4 val5
1 1.1 2 1.1 2.1 4.2
2 5.7 5 5.6 4.9 9.9
3 3.1 3 3.2 2.9 5.9
4 9.6 1 9.5 1.0 2.0
Create a function. The mechanism is to wrap around the base R function duplicated
which has a method for arrays that also handles columns, unlike the method for data.frames that only handles rows. Also, I took you at your word and round each column, but you can specify the number of digits as a parameter.
not_duplicated <- function(x, round_digits, margin=2){
x2 <- apply(x, margin, round, round_digits)
colnames(x)[!duplicated(x2, MARGIN=margin)]
}
The results are as you specified:
x <- as.matrix(x)
not_duplicated(x, 0)
[1] "val1" "val2" "val5"
精彩评论