piRSquared piRSquared - 3 months ago 12
Python Question

pandas drop rows with duplicates in some columns relative to other columns

consider the dataframe

df


data = [
['a', 'b', 'c', 'd'],
['a', 'b', 'e', 'f'],
['g', 'h', 'i', 'j'],
['k', 'l', 'm', 'n'],
['m', 'n', 'o', 'p'],
['q', 'r', 's', 't'],
['u', 'v', 'w', 'x'],
['y', 'z', 'q', 'r'],
]

cols = pd.MultiIndex.from_product([list('AB'), list('XY')])

df = pd.DataFrame(data, columns=cols)

df


enter image description here

I want to compare all rows of
df.A
with all rows of
df.B
and drop the rows that have matches. I do not want to drop rows that have mathces within
df.A
or
df.B
by themselves.

Visual

enter image description here


  • leave rows
    [0, 1]
    alone.

  • drop rows
    [3, 4]
    because
    'm', 'n'
    match

  • drop rows
    [5, 7]
    because
    'q', 'r'
    match



Results should look like

df.loc[[0, 1, 2, 6]]


enter image description here




I've tried to
stack
and
drop_duplicates


df.stack(0).drop_duplicates(keep=False) \
.unstack().dropna() \
.swaplevel(0, 1, 1).sort_index(1)


enter image description here

But that catches duplicates in the same sub-columns, which isn't what I want.

Answer

try this:

In [248]: df[~(df.A.isin(df.B.to_dict('list')).all(1) | df.B.isin(df.A.to_dict('list')).all(1))]
Out[248]:
   A     B
   X  Y  X  Y
0  a  b  c  d
1  a  b  e  f
2  g  h  i  j
6  u  v  w  x

Explanation:

In [249]: df.A.isin(df.B.to_dict('list'))
Out[249]:
       X      Y
0  False  False
1  False  False
2  False  False
3  False  False
4   True   True
5   True   True
6  False  False
7  False  False

In [250]: df.A.isin(df.B.to_dict('list')).all(1)
Out[250]:
0    False
1    False
2    False
3    False
4     True
5     True
6    False
7    False
dtype: bool

In [251]: df.B.isin(df.A.to_dict('list'))
Out[251]:
       X      Y
0  False  False
1  False  False
2  False  False
3   True   True
4  False  False
5  False  False
6  False  False
7   True   True

In [252]: df.B.isin(df.A.to_dict('list')).all(1)
Out[252]:
0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
dtype: bool

combining both:

In [253]: df.A.isin(df.B.to_dict('list')).all(1) | df.B.isin(df.A.to_dict('list')).all(1)
Out[253]:
0    False
1    False
2    False
3     True
4     True
5     True
6    False
7     True
dtype: bool