albert albert - 4 months ago 26
Python Question

pandas: check whether an element is in dataframe or given column leads to strange results

I am doing some data handling based on a DataFrame with the shape of

(135150, 12)
so double checking my results manually is not applicable anymore.

I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.

This behavior is reproducible with even smaller dataframes as follows:

import numpy as np
import pandas as pd

start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)

val = 0.0019

df = pd.DataFrame(arr, columns=['example_value'])

print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`


Since I am a very beginner in data analysis I am not able to explain this behavior.

I know that I am using different approaches involving different datatypes (like
pd.Series
,
np.ndarray
or
np.array
) in order to check if the given value exists in the dataframe. Additionally when using
np.array
or
np.ndarray
the machine accuracy comes in play which I am aware of in mind.

However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like
>
and
<
successfully.

But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.

So could anyone explain, what's going on here?

Answer

The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:

df['example_value'].apply(lambda x: np.isclose(x, val)).any()

or

np.isclose(df['example_value'], val).any()

both of which correctly return True.

Comments