Iain Iain - 3 months ago 16
R Question

Comparing every row in one matrix with every row of another matrix

after a process that has been performed on a matrix (which could be converted to a dataframe or some other form if needs be), I want to see whether the rows of the original matrix appear anywhere in the new matrix.

E.g.Matrix1=(1,1,1,1,1;
2,2,2,2,2;
3,3,3,4,4;
5,5,5,6,6)


I would like to know whether (1,1,1,1,1), (2,2,2,2,2) and the others appear in Matrix2. I also would like to know whether very similar rows appear in matrix 2, like (1,1,1,1,8) or (2,2,2,2,7). Specifically, by similar I mean they have only 1 or 2 column entries different (for 11 columns they have 9 or more column entries the same).

The main problem I have is processing time, I only have 11 columns which I am comparing the matrices on, but both matrices have around 200,000 rows, so I will be comparing 200,000 rows with the other 200,000 rows.
I have a solution with for-loops but it will take far too long with 200,000*200,000...

#Reproducible Example
Firstmatrix<-t(matrix(c(1,1,1,1,1,
2,2,2,2,2,
3,3,3,3,3,
0,0,0,0,0,
2,2,2,2,2,
4,4,4,4,4,
1,2,3,4,5,
1,1,1,1,6),
nrow=5,ncol=8))

Secondmatrix<-t(matrix(c(1,1,1,1,1,
1,1,1,1,1,
2,2,2,2,2,
3,3,3,3,3,
4,4,4,5,5,
5,5,5,4,4,
6,1,1,1,3,
3,1,1,1,6),
nrow=5,ncol=8))

#these matrices will be in a similar form to example above
#To time test larger matrices with more columns I have used:

Firstmatrix<-Firstmatrix[rep(seq_len(8), each=100),]
Secondmatrix<-Secondmatrix[rep(seq_len(8), each=100),]
#which create matrices with 100x as many rows and
t1<-Sys.time()
t2<-Sys.time()
t2-t1
#either side of the code to measure how long it takes

#I came up with:

Column.entries.in.common<-matrix(NA,nrow=nrow(Firstmatrix),ncol=1)
Maximum.Column.entries.in.common<-matrix(NA,nrow=nrow(Firstmatrix),ncol=1)

for (i in 1:nrow(Firstmatrix))
{
for (j in 1:nrow(Secondmatrix))
{
Column.entries.in.common[j]<-sum(Firstmatrix[i,]==Secondmatrix[j,])

Maximum.Column.entries.in.common[i]<-max(Column.entries.in.common)
}
}


which produces a vector with an entry for every row in 'Firstmatrix', and the maximum number of columns that it has in common with any row in 'Secondmatrix', which works for a few thousand rows on each matrix, but won't be feasible to run for 200k*200k. I know I should probably use mapply, but wasn't sure how to specify that it would perform comparisons for every row of 'Firstmatrix' against every row of 'Secondmatrix'. Previous attempts just compared every element of Firstmatrix to Secondmatrix.

Any help would be greatly appreciated. I know it's a lot of computation, so will always take a while to run, but anything faster than my current code, which I think would take about 4 months, is a step in the right direction!

Answer

This should be much faster since it only performs non-vectorized iteration over one set of rows while the rest is vectorized. This uses the fact that columns are stored contiguously so At[, i] will be recycled appropriately to perform the == operation. Another benefit is that taking columns is potentially faster than taking rows.

At <- t(Firstmatrix)
Bt <- t(Secondmatrix)
mx <- sapply(1:ncol(At), function(i) max(colSums(At[, i] == Bt)))

all.equal(mx, c(Maximum.Column.entries.in.common))
## [1] TRUE

Timing

Here is a timing comparison which shows that for the data given it runs in about 1/60th of the elapsed time.

system.time({
  Column.entries.in.common<-matrix(NA,nrow=nrow(Firstmatrix),ncol=1)
  Maximum.Column.entries.in.common<-matrix(NA,nrow=nrow(Firstmatrix),ncol=1)
  for (i in 1:nrow(Firstmatrix))
  {
    for (j in 1:nrow(Secondmatrix))
    {
    Column.entries.in.common[j]<-sum(Firstmatrix[i,]==Secondmatrix[j,])
    Maximum.Column.entries.in.common[i]<-max(Column.entries.in.common)
    }
  }
})
##   user  system elapsed 
##  10.99    0.00   11.12 


system.time({
  At <- t(Firstmatrix)
  Bt <- t(Secondmatrix)
  mx <- sapply(1:ncol(At), function(i) max(colSums(At[, i] == Bt)))
})
##   user  system elapsed 
##   0.19    0.00    0.19