st.ph.n - 6 months ago 37
Python Question

# Compare two different size matrices to make one large matrix - Speed Improvements?

I have two matrices, that I need to use to create a larger matrix. Each matrix is simply a tab-delimited text file that is read. Each matrix has 48 cols with identical identifiers per matrix, with different numbers of rows. The first matrix is 108887x48, and the second is 55482x48. The entries at each position, for each matrix, can be a 0 or 1, so binary. The final output should have the first matrix row ids as the rows, and the second matrix row ids as the cols, so the final matrix is 55482x10887.

What needs to happen here is that for each pos in each row in the first matrix, for every row in the second matrix, if the pos (col) for each matrix is 1, then the final matrix count will go up 1. The highest value any pos in the final matrix can be is 48, and there is expected to be 0's left over.

Example:

``````mat1
A B C D
1id1 0 1 0 1
1id2 1 1 0 0
1id3 1 1 1 1
1id4 0 0 1 0

mat2
A B C D
2id1 1 1 0 0
2id2 0 1 1 0
2id3 1 1 1 1
2id4 1 0 1 0

final
2id1 2id2 2id3 2id4
1id1   1    1    2    0
1id2   2    1    2    1
1id3   2    2    4    2
1id4   0    1    1    1
``````

I have code to do this, however it is painfully slow, which is where I'm mostly asking for help. I've tried to speed up the algorithm as best as possible. It's been running for 24hrs, and is only about 25% of the way through. I have let it run through before and the final output file is 20GB. I'm not experienced with databases, and could implement it here, if osomeone could assist me on how to do so given a snippet of the code below.

``````#!/usr/bin/env python

import sys

mat1in = sys.argv[1]
mat2in = sys.argv[2]

print '\n######################################################################################'
print 'Generating matrix by counts from smaller matrices.'
print '########################################################################################\n'

with open(mat1in, 'r') as f:
cols = [''] + next(f).strip().split('\t')               # First line of matrix is composed of 48 cols
mat1 = [line.strip().split('\t') for line in f]         # Each line in matrix = 'ID': 0 or 1 per col id

with open(mat2in, 'r') as f:
next(f)                                                 # Skip first row, col IDs are taken from mat1
mat2 = [line.strip().split('\t') for line in f]         # Each line in matrix = 'ID': 0 or 1 per col id

out = open('final_matrix.txt', 'w')                             # Output file

#matrix = []
for i in mat2:
header.append(i[0])                                     # Composed of all mat2 row ids

print '\nTotal mat1 rows: ' + str(len(mat1))                    # Get total mat1 rows
print 'Total mat2 rows: ' + str(len(mat2)), '\n'                # Get total mat2 rows
print 'Progress: '                                              # Progress updated as each mat1 id is read

length = len(header)                                            # Length of header, i.e. total number of mat2 ids
totmat1 = len(mat1)                                             # Length of rows (-header), i.e. total number of mat1 ids

total = 0                                                       # Running total - for progress meter
for h in mat1:                                                  # Loop through all mat1 ids - each row in the HC matrix
row = []                                                # Empty list for new row for large matrix
row.append(h[0])                                        # Append mat1 id, as first item in each row
for i in xrange(length-1):                              # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
row.extend('0')
for n in xrange(1,49):                                  # Loop through each col id
for k in mat2:                                  # For every row in mat2
if int(h[n]) == 1 and int(k[n]) == 1:   # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
pos = header.index(k[0])        # Get the position of the mat2 id
row[pos] = str(int(row[pos]) + 1)       # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
print >> out, '\t'.join(row)                            # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
total += 1                                              # Update running total
sys.stdout.write('\r\t' + str(total) + '/' + str(tvh))  # Print progress to screen
sys.stdout.flush()

print '\n######################################################################################'
print 'Matrix complete.'
print '########################################################################################\n'
``````

Here's what profiling the first 30 iterations for the ids in mat1:

``````######################################################################################
Generating matrix by counts from smaller matrices.
########################################################################################

Total mat1 rows: 108887
Total mat2 rows: 55482

Progress:
30/108887^C         2140074 function calls in 101.234 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1   70.176   70.176  101.234  101.234 build_matrix.py:3(<module>)
4    0.000    0.000    0.000    0.000 {len}
55514    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
1719942    1.056    0.000    1.056    0.000 {method 'extend' of 'list' objects}
30    0.000    0.000    0.000    0.000 {method 'flush' of 'file' objects}
35776   29.332    0.001   29.332    0.001 {method 'index' of 'list' objects}
31    0.037    0.001    0.037    0.001 {method 'join' of 'str' objects}
164370    0.589    0.000    0.589    0.000 {method 'split' of 'str' objects}
164370    0.033    0.000    0.033    0.000 {method 'strip' of 'str' objects}
30    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
2    0.000    0.000    0.000    0.000 {next}
3    0.004    0.001    0.004    0.001 {open}
``````

I've also timed each iteration, which takes about 2.5-3s per mat1 id, and if I'm correct would take about 90hrs to complete the whole thing. This is about what it took to run the entire script all the way through.

I've edited some of the main bits, to remove making the rows by append and xrange to making the list in one step by multipling '0' by the lengthof the headers. Also I made a dictionary of the mat2 ids with the index as values to, thinking dict lookup would be quicker than index..

``````headdict = {}

total = 0                                                       # Running total - for progress meter
for h in mat1:                                                  # Loop through all mat1 ids - each row in the HC matrix
timestart = time.clock()
row = [h[0]] + ['0']*(length-1)                 # Empty list for new row for large matrix
#row.append(h[0])                                       # Append mat1 id, as first item in each row
#for i in xrange(length-1):                             # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
#       row.append('0')
for n in xrange(1,49):                                  # Loop through each col id
for k in mat2:                                  # For every row in mat2
if int(h[n]) == 1 and int(k[n]) == 1:   # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
pos = headdict[k[0]] #header.index(k[0])        # Get the position of the mat2 id
row[pos] = str(int(row[pos]) + 1)       # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
print >> out, '\t'.join(row)                            # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
total += 1                                              # Update running total
sys.stdout.write('\r\t' + str(total) + '/' + str(totmat1))  # Print progress to screen
#sys.stdout.flush()
timeend = time.clock()
print timestart - timeend
``````

If you read your two input matrices into NumPy arrays of dtype `numpy.int8`, then the computation is simply
``````m1.dot(m2.T)