Garrett Miller Garrett Miller - 4 months ago 59
Python Question

Comparing rows of two pandas dataframes?

This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?

I have two dataframes

A
and
B
:

A
is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.

For a condensed example:

| A | B | C | D | E |
0 | 0 0 0 1 0
1 | 1 1 1 1 0
2 | 1 0 0 1 1
3 | 0 1 1 1 0


B
is 1024 rows x 10 columns, and is a full iteration from 0 to 1023 in binary form.

Example:

| 0 | 1 | 2 |
0 | 0 0 0
1 | 0 0 1
2 | 0 1 0
3 | 0 1 1
4 | 1 0 0
5 | 1 0 1
6 | 1 1 0
7 | 1 1 1


I am trying to find which rows in
A
, at a particular 10 columns of
A
, correspond with each row of
B
.

Each row of
A[My_Columns_List]
is guaranteed to be somewhere in
B
, but not every row of
B
will match up with a row in
A[My_Columns_List]


For example, I want to show that for columns
[B,D,E]
of
A
,

rows
[1,3]
of A match up with row
[6]
of
B
,

row
[0]
of A matches up with row
[2]
of
B
,

row
[2]
of A matches up with row
[3]
of
B
.

I have tried using:

pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(),
suffixes = ('_B','_A')))


This works, but I was hoping that this method would be faster:

S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]


But when I do this,
out_row_idx
returns an array containing all the indices of
A
, which doesn't tell me anything.
I think this method will be faster, but I don't know why it returns an array from 0 to 999.
Any input would be appreciated!

Also, credit goes to @jezrael and @Divakar for these methods.

Answer

I'll stick by my initial answer but maybe explain better.

You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.

Setup

You said we have a a 1000 x 500 array of ones and zeros. Let's build that.

A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init

In addition, I gave A a MultiIndex to easily group by columns of 10.

Solution

This is very similar to @Divakar's answer with one minor difference that I'll point out.

For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.

twos = 2 ** np.arange(10)

I can execute this for every group of 10 ones and zeros in one go like this

AtB = A.stack(0).dot(twos).unstack()

I stack to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the unstack.

I now have a 1000 x 50 dataframe of numbers that range from 0-1023.

Assume B is a dataframe with each row one of 1024 unique combinations of ones and zeros. B should be sorted like B = B.sort_values().reset_index(drop=True).

This is the part I think I failed at explaining last time. Look at

AtB.loc[:2, :2]

enter image description here

That value in the (0, 0) position, 951 means that the first group of 10 ones and zeros in the first row of A matches the row in B with the index 951. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring B. Ignoring this useless step should save time.

I'm working on an even faster way but this is a function that takes a to dataframes A and B and returns a dataframe of indices where A matches B.

def FindAinB(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    rng = np.arange(A.shape[1])
    A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])

    twos = 2 ** np.arange(10)

    return A.stack(0).dot(twos).unstack()

def FindAinB2(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    rng = np.arange(A.shape[1])
    A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
    # use clever bit shifting instead of dot product with powers
    # questionable improvement
    return (A.stack(0) << np.arange(10)).sum(1).unstack()

I'm channelling my inner @Divakar (read, this is stuff I've learned from Divakar)

def FindAinB3(A, B):
    assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
    num_groups = A.shape[1] / 10
    a = A.values.reshape(-1, num_groups)
    a = np.einsum('ij->i', A.stack(0) << np.arange(10))
    return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)

Timing

enter image description here