I am trying to assign sub-group membership in 4 independent cancer gene expression datasets, training on each dataset in turn, followed by testing the (metagene based) assignment in the remaining three, plus testing on the training cohort itself.
This produces group memberships for each sample, for each comparison and you can gain an idea about sample stability (does a given sample cluster within the same cluster each time?) The problem is that the group labels can differ from comparison to comparison, so comparing against group labels doesn't work.
In order to assess sample stability, I think I will need, for each sample, to catalogue its fellow subgroup memb开发者_如何学Cers, but I haven't been able to conceptualise how precisely I should do this.
For what its worth, the code below should demonstrate the problem a little more clearly than I have described above.
Thanks for reading, and any help is appreciated!
## Here we have 12 samples (A-L), all of which have congruent assignments, except sample K.
## From the two group assignments, we can see that group 1 has become group 4 in class2,
## group 2 has become group 1 etc. etc.
## How do we assess cluster membership with these differing subgroup labels?
class1<-c(1,2,3,4,1,2,3,4,1,2,3,4)
class2<-c(4,1,2,3,4,1,2,3,4,1,3,3)
names(class1)<-LETTERS[1:12]
names(class2)<-LETTERS[1:12]
Try matchClasses
in e1071
, or some of the methods in the seriation
package might help. You need the full two way table of classifications though.
Nice question. Thank you for framing the question so clearly. I am working on clustering myself at the moment, and parked this question for solving later.
Here is a graphical way of solving the problem.
library(ggplot2)
# Create dummy data
# In the first instance, there is perfect transposition between A and D
d <- data.frame(
clust1 = LETTERS[rep(1:4, 3)],
clust2 = LETTERS[rep(c(4,1,2,3), 3)]
)
ggplot(d, aes(x=clust1, y=clust2)) + geom_point(stat="sum", aes(size=..n..))
# Now modify data so that there is a single instance of imperfect matching
d$clust2[1] <- "A"
ggplot(d, aes(x=clust1, y=clust2)) + geom_point(stat="sum", aes(size=..n..))
精彩评论