J Jones J Jones - 1 year ago 294
Python Question

Pandas DataFrame logic operations with NaN

I'm trying to do some comparisons in a pandas DataFrame.

# create simple DataFrame
df = DataFrame(['one', 'two', 'three'], range(1,4), columns=['col1'])
# assign one col1 value to be NAN
df.loc[1, col1] = np.nan
# this comparison works
print df['col1'] == 'three'
# assign all col1 values to NAN
df.loc[:, 'col1'] = np.nan
# this comparison fails
print df['col1'] == 'three'

The first comparison (with only one NAN value in the column) works as expected, but the second (with all NAN values in the column) produces this error:
TypeError: invalid type comparison

What's going on here?

I saw this question, which suggests some possible but kind of hack-y solutions to this problem.

But why is this the behavior happening in the first place? Is this restriction useful, somehow? I can fix it by using
before my comparisons, but this seems clunky and irritating.

So my questions are:

1. What is the cleanest way around this issue?

2. Why is this the default behavior, anyway?


Your col1 is of type float after assigning all np.nan so trying to compare to a string throws a TypeError. :

df = pd.DataFrame(['one', 'two', 'three'], range(1, 4), columns=['col1'])
df.loc[1, 'col1'] = np.nan

1    NaN
2    two
3  three

Assigning a single np.nan to a column that contains string values leaves dtype object:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 1 to 3
Data columns (total 1 columns):
col1    2 non-null object
dtypes: object(1)

But all np.nan values converts to float:

df.loc[:, 'col1'] = np.nan

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 1 to 3
Data columns (total 1 columns):
col1    0 non-null float64
dtypes: float64(1)