I have several rows of data (tab separated). I want to find the ro开发者_如何学JAVAw which matches elements from two columns (3rd & 4th) in each row with two other columns (10th & 11th). For example, in row 1, 95428891 & 95443771 in column 3 & 4 matches elements in columns 10 & 11 in row 19. Similarly the reciprocal is also true. Elements in columns 3 & 4 in the 19th row also match elements in columns 10 & 11 in row 1. I need to be able to go through each row and output row indices for corresponding matches. It is possible that sometimes only one of the columns match instead of both (because sometimes there are duplicate numbers), but I need to pick only rows where both columns match and also where there is reciprocal match. So it would be a good idea to output row indices where there is reciprocal match, for eg., 1 & 19 as tab separated values (maybe in a different data.frame object). And the rows that do not have reciprocal matches can be output separately. I am trying to implement this in R to run through several rows of data.
1313 chr2 95428891 95443771 14880 chr2:96036782 205673 + chr2 96036782 96052481
1313 chr2 95428896 95443771 14875 chr2:97111880 205214 - chr2 97111880 97127588
1313 chr2 95443771 95526464 82693 chr2:95609272 1748861 - chr2 95609272 95691902
1313 chr2 95477143 95486318 9175 chr2:97616847 177391 + chr2 97616847 97626039
1313 chr2 95486323 95521267 34944 chr2:97035158 268351 + chr2 97035158 97070183
1313 chr2 95515418 95525958 10540 chr2:95563236 132439 + chr2 95563236 95572666
1314 chr2 95563236 95572666 9430 chr2:95515418 132439 + chr2 95515418 95525958
1314 chr2 95563236 95572666 9430 chr2:95609778 126017 - chr2 95609778 95620287
1314 chr2 95563236 95569115 5879 chr2:97064308 89848 + chr2 97064308 97070183
164 chr2 95609272 95691902 82630 chr2:95443771 1748861 - chr2 95443771 95526464
1314 chr2 95609778 95620287 10509 chr2:95563236 126017 - chr2 95563236 95572666
1314 chr2 95614473 95649363 34890 chr2:97035158 394821 - chr2 97035158 97070173
1314 chr2 95649368 95658543 9175 chr2:97616847 177822 - chr2 97616847 97626039
164 chr2 95775062 95814080 39018 chr2:97578938 0 - chr2 97578938 97616780
1315 chr2 95778788 95781856 3068 chr2:97609982 31302 - chr2 97609982 97616788
164 chr2 95780657 95829665 49008 chr2:96053880 882178 - chr2 96053880 96102738
1316 chr2 95829982 95865446 35464 chr2:97296848 242680 - chr2 97296848 97333087
1316 chr2 95829982 95935104 105122 chr2:97438085 1169669 + chr2 97438085 97544431
1317 chr2 96036782 96052481 15699 chr2:95428891 205673 + chr2 95428891 95443771
The accepted answer works, however with large-ish arrays, typically over 50k or so you'll run into memory issues, as the matrices you're creating are huge.
I'd do something like:
match(
interaction( indat$V3, indat$V10),
interaction( indat$V4, indat$V11)
);
Which concatenates all the values of interest into factors and does a match.
Unsure if this is faster, but it's much more memory efficient.
You have not indicated what you would consider a correct answer and your terminology seems a bit vague when you talk about "where there is a reciprocal match", but if I understand the task correctly as finding all rows where col.3 == col.10 & col.4 == col.11, then this should accomplish the task:
which( outer(indat$V4, indat$V11, "==") &
outer(indat$V3, indat$V10, "=="),
arr.ind=TRUE)
# result
row col
[1,] 19 1
[2,] 10 3
[3,] 7 6
[4,] 8 6
[5,] 6 7
[6,] 11 8
[7,] 3 10
[8,] 7 11
[9,] 8 11
[10,] 1 19
The outer function applies a function 'FUN', in this case "==", to all two-way combinations of x and y, its first and second arguments, so here we get an n x n matrix with logical entries and I am taking the logical 'and' of two such matrices. So the rows where there are matches with other rows are:
unique( c(which( outer(indat$V4, indat$V11, "==") &
outer(indat$V3, indat$V10, "=="),
arr.ind=TRUE) ))
#[1] 19 10 7 8 6 11 3 1
So the set with no matches, assuming a data.frame named indat, is:
matches <- unique( c(which( outer(indat$V4, indat$V11, "==") &
outer(indat$V3, indat$V10, "=="), arr.ind=TRUE) ))
indat[ ! 1:NROW(indat) %in% matches, ]
And the ones with matches are:
indat[ 1:NROW(indat) %in% matches, ]
The below function compare
takes advantage of R´s capability for fast sorting. Function arguments a
and b
are matrices; rows in a
are screend for matching rows in b
for any number of columns. In case column order is irrelevant, set row_order=TRUE
to have the row entries sorted in increasing order. Guess the function should work as well with dataframes and character / factors columns, as well as duplicate entries in a
and/or b
. Despite using the for
& while
it´s relatively quick in returning the first row match in b
for each row of a
(or 0
, if no match is found).
compare<-function(a,b,row_order=TRUE){
len1<-dim(a)[1]
len2<-dim(b)[1]
if(row_order){
a<-t(apply(t(a), 2, sort))
b<-t(apply(t(b), 2, sort))
}
ord1<-do.call(order, as.data.frame(a))
ord2<-do.call(order, as.data.frame(b))
a<-a[ord1,]
b<-b[ord2,]
found<-rep(0,len1)
dims<-dim(a)[2]
do_dims<-c(1:dim(a)[2])
at<-1
for(i in 1:len1){
for(m in do_dims){
while(b[at,m]<a[i,m]){
at<-(at+1)
if(at>len2){break}
}
if(at>len2){break}
if(b[at,m]>a[i,m]){break}
if(m==dims){found[i]<-at}
}
if(at>len2){break}
}
return(found[order(ord1)]) # indicates the first match of a found in b and zero otherwise
}
# example data sets:
a <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- rbind(a,b) # example of b containing a
# run the function
found<-compare(a,b,row_order=TRUE)
# check
all(found>0)
# rows in a not contained in b (none in this example):
a[found==0,]
精彩评论