zzzeek - 8 months ago 24

Python Question

Given two dataframes as below:

`>>> import pandas as pd`

>>> df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])

>>> df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])

>>> df_a

a b

0 1 4

1 2 5

2 3 6

>>> df_b

c d

0 2 7

1 3 8

we would like to produce a SQL-style join of both dataframes using a non-simplistic criteria, let's say "df_b.c > df_a.a". From what I can tell, while

`merge()`

In SQL, the results look like this:

`# inner join`

sqlite> select * from df_a join df_b on c > a;

1|4|2|7

1|4|3|8

2|5|3|8

# outer join

sqlite> select * from df_a left outer join df_b on c > a;

1|4|2|7

1|4|3|8

2|5|3|8

3|6||

my current approach for inner join is to produce a cartesian product

of df_a and df_b, by adding a column of "1"s to both, then using

merge() on the "1"s column, then applying the "c > a" criteria.

`>>> import numpy as np`

>>> df_a['ones'] = np.ones(3)

>>> df_b['ones'] = np.ones(2)

>>> cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')

>>> cartesian

a b ones c d

0 1 4 1 2 7

1 1 4 1 3 8

2 2 5 1 2 7

3 2 5 1 3 8

4 3 6 1 2 7

5 3 6 1 3 8

>>> cartesian[cartesian.c > cartesian.a]

a b ones c d

0 1 4 1 2 7

1 1 4 1 3 8

3 2 5 1 3 8

for outer join, I'm not sure of the best way to go, so far

I've been playing with getting the inner join, then applying the negation

of the criteria to get all the other rows, then trying to edit that

"negation" set onto the original, but it doesn't really work.

`df_a['_left_index'] = df_a.index`

then we do the cartesian and get the inner join:

`cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')`

innerjoin = cartesian[cartesian.c > cartesian.a]

then I get the additional index ids in "df_a" that we'll need, and get the rows from "df_a":

`remaining_left_ids = set(df_a['_left_index']).\`

difference(innerjoin['_left_index'])

remaining = df_a.ix[remaining_left_ids]

then we use a straight concat(), which replaces missing columns with "NaN" for left (I thought it wasn't doing this earlier but I guess it does):

`outerjoin = pd.concat([innerjoin, remaining]).reset_index()`

HYRY's idea to do the cartesian on just those cols that we need to compare on is basically the right answer, though in my specific case it might be a little tricky to implement (generalized and all).

questions:

- How would you produce a "join" of df_1 and df_2 on "c > a"? Would

you do the same "cartesian product, filter" approach or is there some better

way? - How would you produce the "left outer join" of same?

Answer

I use the outer method of ufunc to calculate the result, here is the example:

First, some data:

```
import pandas as pd
import numpy as np
df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}, {"a": 4, "b": 8}, {"a": 1, "b": 7}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}, {"c": 2, "d": 10}])
print "df_a"
print df_a
print "df_b"
print df_b
```

output:

```
df_a
a b
0 1 4
1 2 5
2 3 6
3 4 8
4 1 7
df_b
c d
0 2 7
1 3 8
2 2 10
```

Inner join, because this only calculate the cartesian product of `c`

& `a`

, memory useage is less than cartesian product of the whole DataFrame:

```
ia, ib = np.where(np.less.outer(df_a.a, df_b.c))
print pd.concat((df_a.take(ia).reset_index(drop=True),
df_b.take(ib).reset_index(drop=True)), axis=1)
```

output:

```
a b c d
0 1 4 2 7
1 1 4 3 8
2 1 4 2 10
3 2 5 3 8
4 1 7 2 7
5 1 7 3 8
6 1 7 2 10
```

to calculate the left outer join, use `numpy.setdiff1d()`

to find all the rows of `df_a`

that not in the inner join:

```
na = np.setdiff1d(np.arange(len(df_a)), ia)
nb = -1 * np.ones_like(na)
oa = np.concatenate((ia, na))
ob = np.concatenate((ib, nb))
print pd.concat([df_a.take(oa).reset_index(drop=True),
df_b.take(ob).reset_index(drop=True)], axis=1)
```

output:

```
a b c d
0 1 4 2 7
1 1 4 3 8
2 1 4 2 10
3 2 5 3 8
4 1 7 2 7
5 1 7 3 8
6 1 7 2 10
7 3 6 NaN NaN
8 4 8 NaN NaN
```