Danny David Leybzon Danny David Leybzon - 2 months ago 37
Python Question

Conditionally extracting Pandas rows based on another Pandas dataframe

I have two dataframes:

df1:


col1 col2
1 2
1 3
2 4


df2:


col1
2
3


I want to extract all the rows in
df1
where
df1
's
col2
not in
df2
's
col1
. So in this case it would be:

col1 col2
2 4


I first tried:

df1[df1['col2'] not in df2['col1']]


But it returned:


TypeError: 'Series' objects are mutable, thus they cannot be hashed


I then tried:

df1[df1['col2'] not in df2['col1'].tolist]


But it returned:


TypeError: argument of type 'instancemethod' is not iterable

Answer

You can use isin with ~ for inverting boolean mask:

print (df1['col2'].isin(df2['col1']))
0     True
1     True
2    False
Name: col2, dtype: bool

print (~df1['col2'].isin(df2['col1']))
0    False
1    False
2     True
Name: col2, dtype: bool

print (df1[~df1['col2'].isin(df2['col1'])])
   col1  col2
2     2     4

Timings:

In [8]: %timeit (df1.query('col2 not in @df2.col1'))
1000 loops, best of 3: 1.57 ms per loop

In [9]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1000 loops, best of 3: 466 ┬Ás per loop