match two columns with two other columns_问答_开发者

I have several rows of data (tab separated). I want to find the ro开发者_如何学JAVAw which matches elements from two columns (3rd & 4th) in each row with two other columns (10th & 11th). For example, in row 1, 95428891 & 95443771 in column 3 & 4 matches elements in columns 10 & 11 in row 19. Similarly the reciprocal is also true. Elements in columns 3 & 4 in the 19th row also match elements in columns 10 & 11 in row 1. I need to be able to go through each row and output row indices for corresponding matches. It is possible that sometimes only one of the columns match instead of both (because sometimes there are duplicate numbers), but I need to pick only rows where both columns match and also where there is reciprocal match. So it would be a good idea to output row indices where there is reciprocal match, for eg., 1 & 19 as tab separated values (maybe in a different data.frame object). And the rows that do not have reciprocal matches can be output separately. I am trying to implement this in R to run through several rows of data.

1313    chr2    95428891    95443771    14880   chr2:96036782   205673  +   chr2    96036782    96052481
1313    chr2    95428896    95443771    14875   chr2:97111880   205214  -   chr2    97111880    97127588
1313    chr2    95443771    95526464    82693   chr2:95609272   1748861 -   chr2    95609272    95691902
1313    chr2    95477143    95486318    9175    chr2:97616847   177391  +   chr2    97616847    97626039
1313    chr2    95486323    95521267    34944   chr2:97035158   268351  +   chr2    97035158    97070183
1313    chr2    95515418    95525958    10540   chr2:95563236   132439  +   chr2    95563236    95572666
1314    chr2    95563236    95572666    9430    chr2:95515418   132439  +   chr2    95515418    95525958
1314    chr2    95563236    95572666    9430    chr2:95609778   126017  -   chr2    95609778    95620287
1314    chr2    95563236    95569115    5879    chr2:97064308   89848   +   chr2    97064308    97070183
164     chr2    95609272    95691902    82630   chr2:95443771   1748861 -   chr2    95443771    95526464
1314    chr2    95609778    95620287    10509   chr2:95563236   126017  -   chr2    95563236    95572666
1314    chr2    95614473    95649363    34890   chr2:97035158   394821  -   chr2    97035158    97070173
1314    chr2    95649368    95658543    9175    chr2:97616847   177822  -   chr2    97616847    97626039
164     chr2    95775062    95814080    39018   chr2:97578938   0       -   chr2    97578938    97616780
1315    chr2    95778788    95781856    3068    chr2:97609982   31302   -   chr2    97609982    97616788
164     chr2    95780657    95829665    49008   chr2:96053880   882178  -   chr2    96053880    96102738
1316    chr2    95829982    95865446    35464   chr2:97296848   242680  -   chr2    97296848    97333087
1316    chr2    95829982    95935104    105122  chr2:97438085   1169669 +   chr2    97438085    97544431
1317    chr2    96036782    96052481    15699   chr2:95428891   205673  +   chr2    95428891    95443771

The accepted answer works, however with large-ish arrays, typically over 50k or so you'll run into memory issues, as the matrices you're creating are huge.

I'd do something like:

match(
  interaction( indat$V3, indat$V10),
  interaction( indat$V4, indat$V11)
);

Which concatenates all the values of interest into factors and does a match.

Unsure if this is faster, but it's much more memory efficient.

You have not indicated what you would consider a correct answer and your terminology seems a bit vague when you talk about "where there is a reciprocal match", but if I understand the task correctly as finding all rows where col.3 == col.10 & col.4 == col.11, then this should accomplish the task:

which( outer(indat$V4, indat$V11, "==") & 
       outer(indat$V3, indat$V10, "=="), 
       arr.ind=TRUE)
# result
      row col
 [1,]  19   1
 [2,]  10   3
 [3,]   7   6
 [4,]   8   6
 [5,]   6   7
 [6,]  11   8
 [7,]   3  10
 [8,]   7  11
 [9,]   8  11
[10,]   1  19

The outer function applies a function 'FUN', in this case "==", to all two-way combinations of x and y, its first and second arguments, so here we get an n x n matrix with logical entries and I am taking the logical 'and' of two such matrices. So the rows where there are matches with other rows are:

unique( c(which( outer(indat$V4, indat$V11, "==") & 
outer(indat$V3, indat$V10, "=="), 
arr.ind=TRUE) ))

#[1] 19 10  7  8  6 11  3  1

So the set with no matches, assuming a data.frame named indat, is:

matches <- unique( c(which( outer(indat$V4, indat$V11, "==") & 
                      outer(indat$V3, indat$V10, "=="), arr.ind=TRUE) ))
indat[ ! 1:NROW(indat) %in% matches, ]

And the ones with matches are:

indat[ 1:NROW(indat) %in% matches, ]

The below function compare takes advantage of R´s capability for fast sorting. Function arguments a and b are matrices; rows in a are screend for matching rows in b for any number of columns. In case column order is irrelevant, set row_order=TRUE to have the row entries sorted in increasing order. Guess the function should work as well with dataframes and character / factors columns, as well as duplicate entries in a and/or b. Despite using the for & while it´s relatively quick in returning the first row match in b for each row of a (or 0, if no match is found).

compare<-function(a,b,row_order=TRUE){

    len1<-dim(a)[1]
    len2<-dim(b)[1]
    if(row_order){
        a<-t(apply(t(a), 2, sort))
        b<-t(apply(t(b), 2, sort))
    }
    ord1<-do.call(order, as.data.frame(a))
    ord2<-do.call(order, as.data.frame(b))
    a<-a[ord1,]
    b<-b[ord2,] 
    found<-rep(0,len1)  
    dims<-dim(a)[2]
    do_dims<-c(1:dim(a)[2])
    at<-1
    for(i in 1:len1){
        for(m in do_dims){
            while(b[at,m]<a[i,m]){
                at<-(at+1)      
                if(at>len2){break}              
            }
            if(at>len2){break}
            if(b[at,m]>a[i,m]){break}
            if(m==dims){found[i]<-at}
        }
        if(at>len2){break}
    }
    return(found[order(ord1)]) # indicates the first match of a found in b and zero otherwise

}


# example data sets:
a <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- matrix(sample.int(1E4,size = 1E4, replace = T), ncol = 4)
b <- rbind(a,b) # example of b containing a


# run the function
found<-compare(a,b,row_order=TRUE)
# check
all(found>0) 
# rows in a not contained in b (none in this example):
a[found==0,]