lostsoul29 lostsoul29 - 8 months ago 34
Python Question

Pandas.Dataframe.duplicated() includes missing rows as duplicates

I have a Pandas Dataframe named

that has an attribute named

I'm using the
method to retrieve the duplicates from this dataframe like so:

In [16]: Merged[Merged.RegimentalNumber.duplicated() == True]

However, it looks like the result also includes missing values of
as duplicates.

Does the
method take a flag or parameter to exclude missing values as duplicates? I took a look at the API Documentation for this method but could not find such a flag.

Of course I can then simply exclude the missing values like this:

In [17]: duplicates = Merged[Merged.RegimentalNumber.duplicated() == True]
In [18]: duplicates[duplicates.RegimentalNumber.notnull()]

However, it doesn't seem right to me that the
method also includes missing values as duplicates. Is there a simpler, one step solution?


You could use df.dropna() to ensure NULL values are ignored. For example,

import numpy as np
import pandas as pd
df = pd.DataFrame({'foo': [1, np.nan, 1, 2, 3, 2, 3, np.nan, float('nan'), 
                           np.nan, float('nan'), 'xyz']})



2   1
5   2
6   3

Note that df['foo'].duplicated() is a boolean Series whose index may be larger than df.dropna().index. However, when you select rows using df.dropna().loc, the boolean Series index is reindexed to match df.dropna().index, so the NULL values conveniently get dropped.