Robert Smith Robert Smith - 1 year ago 287
Python Question

set difference for pandas

A simple pandas question:

Is there a

functionality to drop every row involved in the duplication?

An equivalent question is the following: Does pandas have a set difference for dataframes?

For example:

In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

In [7]: df1
col1 col2
0 1 2
1 2 3
2 3 4

In [8]: df2
col1 col2
0 4 6
1 2 3
2 5 5

so maybe something like df2.set_diff(df1) will produce this:

col1 col2
0 4 6
2 5 5

However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.

By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.


Answer Source
from pandas import  DataFrame

df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

print df2[~df2.isin(df1).all(1)]
print df2[(df2!=df1)].dropna(how='all')
print df2[~(df2==df1)].dropna(how='all')