lostsoul29 lostsoul29 - 7 months ago 24
Python Question

Pandas.Dataframe.duplicated() includes missing rows as duplicates

I have a Pandas Dataframe named

Merged
that has an attribute named
RegimentalNumber
.

I'm using the
Pandas.Dataframe.duplicated()
method to retrieve the duplicates from this dataframe like so:

In [16]: Merged[Merged.RegimentalNumber.duplicated() == True]


However, it looks like the result also includes missing values of
RegimentalNumber
as duplicates.

Does the
duplicated()
method take a flag or parameter to exclude missing values as duplicates? I took a look at the API Documentation for this method but could not find such a flag.

Of course I can then simply exclude the missing values like this:

In [17]: duplicates = Merged[Merged.RegimentalNumber.duplicated() == True]
In [18]: duplicates[duplicates.RegimentalNumber.notnull()]


However, it doesn't seem right to me that the
duplicated()
method also includes missing values as duplicates. Is there a simpler, one step solution?

Answer

You could use df.dropna() to ensure NULL values are ignored. For example,

import numpy as np
import pandas as pd
df = pd.DataFrame({'foo': [1, np.nan, 1, 2, 3, 2, 3, np.nan, float('nan'), 
                           np.nan, float('nan'), 'xyz']})

print(df.dropna().loc[df['foo'].duplicated()])

yields

  foo
2   1
5   2
6   3

Note that df['foo'].duplicated() is a boolean Series whose index may be larger than df.dropna().index. However, when you select rows using df.dropna().loc, the boolean Series index is reindexed to match df.dropna().index, so the NULL values conveniently get dropped.