chris chris - 2 months ago 14
Python Question

How to multiply every column of one Pandas Dataframe with every column of another Dataframe efficiently?

I'm trying to multiply two pandas dataframes with each other. Specifically, I want to multiply every column with every column of the other df.

The dataframes are one-hot encoded, so they look like this:

col_1, col_2, col_3, ...
0 1 0
1 0 0
0 0 1
...


I could just iterate through each of the columns using a for loop, but in python that is computationally expensive, and I'm hoping there's an easier way.

One of the dataframes has 500 columns, the other has 100 columns.

This is the fastest version that I've been able to write so far:

interact_pd = pd.DataFrame(index=df_1.index)
df1_columns = [column for column in df_1]
for column in df_2:
col_pd = df_1[df1_columns].multiply(df_2[column], axis="index")
interact_pd = interact_pd.join(col_pd, lsuffix='_' + column)


I iterate over each column in df_2 and multiply all of df_1 by that column, then I append the result to interact_pd. I would rather not do it using a for loop however, as this is very computationally costly. Is there a faster way of doing it?

EDIT: example

df_1:

1col_1, 1col_2, 1col_3
0 1 0
1 0 0
0 0 1


df_2:

2col_1, 2col_2
0 1
1 0
0 0


interact_pd:

1col_1_2col_1, 1col_2_2col_1,1col_3_2col_1, 1col_1_2col_2, 1col_2_2col_2,1col_3_2col_2

0 0 0 0 1 0
1 0 0 0 0 0
0 0 0 0 0 0

Answer
# use numpy to get a pair of indices that map out every
# combination of columns from df_1 and columns of df_2
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)

# use pandas MultiIndex to create a nice MultiIndex for
# the final output
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
                                  names=[df_1.columns.name, df_2.columns.name])

# df_1.values[:, pidx[0]] slices df_1 values for every combination
# like wise with df_2.values[:, pidx[1]]
# finally, I marry up the product of arrays with the MultiIndex
pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
             columns=lcol)

enter image description here


Timing

code

from string import ascii_letters

df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 26)), columns=list(ascii_letters[:26]))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 52)), columns=list(ascii_letters))

def pir1(df_1, df_2):
    pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)

    lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
                                      names=[df_1.columns.name, df_2.columns.name])

    return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
                        columns=lcol)

def Test2(DA,DB):
  MA = DA.as_matrix()
  MB = DB.as_matrix()
  MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
  Col = []
  for i in range(len(MB[0])):
    for j in range(len(MA[0])):
      MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
      Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
  return pd.DataFrame(MM,dtype=int,columns=Col)

results

enter image description here