zzzeek zzzeek - 3 months ago 9
Python Question

how to perform an inner or outer join of DataFrames with Pandas on non-simplistic criterion

Given two dataframes as below:

>>> import pandas as pd

>>> df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
>>> df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])
>>> df_a
a b
0 1 4
1 2 5
2 3 6

>>> df_b
c d
0 2 7
1 3 8


we would like to produce a SQL-style join of both dataframes using a non-simplistic criteria, let's say "df_b.c > df_a.a". From what I can tell, while
merge()
is certainly part of the solution, I can't use it directly since it doesn't accept arbitrary expressions for "ON" criteria (unless I'm missing something?).

In SQL, the results look like this:

# inner join
sqlite> select * from df_a join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8

# outer join
sqlite> select * from df_a left outer join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8
3|6||


my current approach for inner join is to produce a cartesian product
of df_a and df_b, by adding a column of "1"s to both, then using
merge() on the "1"s column, then applying the "c > a" criteria.

>>> import numpy as np
>>> df_a['ones'] = np.ones(3)
>>> df_b['ones'] = np.ones(2)
>>> cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
>>> cartesian
a b ones c d
0 1 4 1 2 7
1 1 4 1 3 8
2 2 5 1 2 7
3 2 5 1 3 8
4 3 6 1 2 7
5 3 6 1 3 8
>>> cartesian[cartesian.c > cartesian.a]
a b ones c d
0 1 4 1 2 7
1 1 4 1 3 8
3 2 5 1 3 8


for outer join, I'm not sure of the best way to go, so far
I've been playing with getting the inner join, then applying the negation
of the criteria to get all the other rows, then trying to edit that
"negation" set onto the original, but it doesn't really work.

Edit. HYRY answered the specific question here but I needed something more generic and more within the Pandas API, as my join criterion could be anything, not just that one comparison. For outerjoin, first I'm adding an extra index to the "left" side that will maintain itself after I do the inner join:

df_a['_left_index'] = df_a.index


then we do the cartesian and get the inner join:

cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
innerjoin = cartesian[cartesian.c > cartesian.a]


then I get the additional index ids in "df_a" that we'll need, and get the rows from "df_a":

remaining_left_ids = set(df_a['_left_index']).\
difference(innerjoin['_left_index'])
remaining = df_a.ix[remaining_left_ids]


then we use a straight concat(), which replaces missing columns with "NaN" for left (I thought it wasn't doing this earlier but I guess it does):

outerjoin = pd.concat([innerjoin, remaining]).reset_index()


HYRY's idea to do the cartesian on just those cols that we need to compare on is basically the right answer, though in my specific case it might be a little tricky to implement (generalized and all).

questions:


  1. How would you produce a "join" of df_1 and df_2 on "c > a"? Would
    you do the same "cartesian product, filter" approach or is there some better
    way?

  2. How would you produce the "left outer join" of same?


Answer

I use the outer method of ufunc to calculate the result, here is the example:

First, some data:

import pandas as pd
import numpy as np
df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}, {"a": 4, "b": 8}, {"a": 1, "b": 7}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}, {"c": 2, "d": 10}])
print "df_a"
print df_a
print "df_b"
print df_b

output:

df_a
   a  b
0  1  4
1  2  5
2  3  6
3  4  8
4  1  7
df_b
   c   d
0  2   7
1  3   8
2  2  10

Inner join, because this only calculate the cartesian product of c & a, memory useage is less than cartesian product of the whole DataFrame:

ia, ib = np.where(np.less.outer(df_a.a, df_b.c))
print pd.concat((df_a.take(ia).reset_index(drop=True), 
                 df_b.take(ib).reset_index(drop=True)), axis=1)

output:

   a  b  c   d
0  1  4  2   7
1  1  4  3   8
2  1  4  2  10
3  2  5  3   8
4  1  7  2   7
5  1  7  3   8
6  1  7  2  10

to calculate the left outer join, use numpy.setdiff1d() to find all the rows of df_a that not in the inner join:

na = np.setdiff1d(np.arange(len(df_a)), ia)
nb = -1 * np.ones_like(na)
oa = np.concatenate((ia, na))
ob = np.concatenate((ib, nb))
print pd.concat([df_a.take(oa).reset_index(drop=True), 
                 df_b.take(ob).reset_index(drop=True)], axis=1)

output:

   a  b   c   d
0  1  4   2   7
1  1  4   3   8
2  1  4   2  10
3  2  5   3   8
4  1  7   2   7
5  1  7   3   8
6  1  7   2  10
7  3  6 NaN NaN
8  4  8 NaN NaN
Comments