J Jones J Jones - 6 months ago 176
Python Question

Pandas DataFrame logic operations with NaN

I'm trying to do some comparisons in a pandas DataFrame.

# create simple DataFrame
df = DataFrame(['one', 'two', 'three'], range(1,4), columns=['col1'])
# assign one col1 value to be NAN
df.loc[1, col1] = np.nan
# this comparison works
print df['col1'] == 'three'
# assign all col1 values to NAN
df.loc[:, 'col1'] = np.nan
# this comparison fails
print df['col1'] == 'three'


The first comparison (with only one NAN value in the column) works as expected, but the second (with all NAN values in the column) produces this error:
TypeError: invalid type comparison


What's going on here?

I saw this question, which suggests some possible but kind of hack-y solutions to this problem.

But why is this the behavior happening in the first place? Is this restriction useful, somehow? I can fix it by using
df.fillna('')
before my comparisons, but this seems clunky and irritating.

So my questions are:

1. What is the cleanest way around this issue?

2. Why is this the default behavior, anyway?

Answer

Your col1 is of type float after assigning all np.nan so trying to compare to a string throws a TypeError. :

df = pd.DataFrame(['one', 'two', 'three'], range(1, 4), columns=['col1'])
df.loc[1, 'col1'] = np.nan

    col1
1    NaN
2    two
3  three

Assigning a single np.nan to a column that contains string values leaves dtype object:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 1 to 3
Data columns (total 1 columns):
col1    2 non-null object
dtypes: object(1)

But all np.nan values converts to float:

df.loc[:, 'col1'] = np.nan
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 1 to 3
Data columns (total 1 columns):
col1    0 non-null float64
dtypes: float64(1)