Garrett Miller - 8 months ago 91

Python Question

This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?

I have two dataframes

`A`

`B`

`A`

For a condensed example:

`| A | B | C | D | E |`

0 | 0 0 0 1 0

1 | 1 1 1 1 0

2 | 1 0 0 1 1

3 | 0 1 1 1 0

`B`

Example:

`| 0 | 1 | 2 |`

0 | 0 0 0

1 | 0 0 1

2 | 0 1 0

3 | 0 1 1

4 | 1 0 0

5 | 1 0 1

6 | 1 1 0

7 | 1 1 1

I am trying to find which rows in

`A`

`A`

`B`

Each row of

`A[My_Columns_List]`

`B`

`B`

`A[My_Columns_List]`

For example, I want to show that for columns

`[B,D,E]`

`A`

rows

`[1,3]`

`[6]`

`B`

row

`[0]`

`[2]`

`B`

row

`[2]`

`[3]`

`B`

I have tried using:

`pd.merge(B.reset_index(), A.reset_index(),`

left_on = B.columns.tolist(),

right_on =A.columns[My_Columns_List].tolist(),

suffixes = ('_B','_A')))

This works, but I was hoping that this method would be faster:

`S = 2**np.arange(10)`

A_ID = np.dot(A[My_Columns_List],S)

B_ID = np.dot(B,S)

out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]

But when I do this,

`out_row_idx`

`A`

I think this method will be faster, but I don't know why it returns an array from 0 to 999.

Any input would be appreciated!

Also, credit goes to @jezrael and @Divakar for these methods.

Answer

I'll stick by my initial answer but maybe explain better.

You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.

You said we have a a 1000 x 500 array of ones and zeros. Let's build that.

```
A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init
```

In addition, I gave `A`

a `MultiIndex`

to easily group by columns of 10.

This is very similar to @Divakar's answer with one minor difference that I'll point out.

For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.

```
twos = 2 ** np.arange(10)
```

I can execute this for every group of 10 ones and zeros in one go like this

```
AtB = A.stack(0).dot(twos).unstack()
```

I `stack`

to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the `unstack`

.

I now have a 1000 x 50 dataframe of numbers that range from 0-1023.

Assume `B`

is a dataframe with each row one of 1024 unique combinations of ones and zeros. `B`

should be sorted like `B = B.sort_values().reset_index(drop=True)`

.

This is the part I think I failed at explaining last time. Look at

```
AtB.loc[:2, :2]
```

That value in the `(0, 0)`

position, `951`

means that the first group of 10 ones and zeros in the first row of `A`

matches the row in `B`

with the index `951`

. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring `B`

. Ignoring this useless step should save time.

I'm working on an even faster way but this is a function that takes a to dataframes `A`

and `B`

and returns a dataframe of indices where `A`

matches `B`

.

```
def FindAinB(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
twos = 2 ** np.arange(10)
return A.stack(0).dot(twos).unstack()
```

```
def FindAinB2(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
# use clever bit shifting instead of dot product with powers
# questionable improvement
return (A.stack(0) << np.arange(10)).sum(1).unstack()
```

I'm channelling my inner @Divakar (read, this is stuff I've learned from Divakar)

```
def FindAinB3(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
num_groups = A.shape[1] / 10
a = A.values.reshape(-1, num_groups)
a = np.einsum('ij->i', A.stack(0) << np.arange(10))
return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)
```