Rich Thompson Rich Thompson - 3 months ago 13
Python Question

Pandas: when can I directly assign to values array

I was messing with pandas in the interpreter and the following behavior took my by surprise:

>>> data2 = [[1, np.nan], [2, -17]]
>>> f2 = pd.DataFrame(data2)
>>> f2
0 1
0 1 NaN
1 2 -17.0
>>> f2.values[1, 1] = -99.0
>>> f2
0 1
0 1 NaN
1 2 -17.0
>>> type(f2.values[0, 0]), type(f2.values[1, 0])
(<class 'numpy.float64'>, <class 'numpy.float64'>)

I am unable to assign directly to the underlying array using the values attribute. However, if I explicitly start with floats, I can:

>>> data = [[1.0, np.nan], [2.0, -17.0]]
>>> f = pd.DataFrame(data)
>>> f
0 1
0 1.0 NaN
1 2.0 -17.0
>>> f.values[1, 1] = -99.0
>>> f
0 1
0 1.0 NaN
1 2.0 -99.0

Does anyone know a rule that would have allowed me to predict this? I feel like I must be missing something obvious.


Pandas does not guarantee when assignments to df.values affect df, so I would recommend never trying to modify df via df.values. How this works is an implementation detail.

Under the hood, df stores its values in "blocks". The blocks are segregated by dtype, though sometimes more than one block can have the same dtype.

When you use

df2 = pd.DataFrame([[1, np.nan], [2, -17]])

The first column has integer dtype while the second column has floating point dtype.

In [27]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
0    2 non-null int64
1    1 non-null float64
dtypes: float64(1), int64(1)
memory usage: 112.0 bytes

When you access the df2.values property, a single NumPy array is returned. When the df2 has columns of non-homogeneous dtype, Pandas promotes the dtypes to a single common dtype. In the worst case, the common dtype may be object. In this case, the integers are promoted to floating point dtype.

In [28]: df2.values.dtype
Out[28]: dtype('float64')

The dtype promotion requires that the underlying data be copied into a new NumPy array. Thus, modifying the copy returned by df2.values does not affect the original data in df2.

In contrast, if the DataFrame's data is entirely of one dtype, then f.values returns a view of the original data. So in this special case, modifying f.values affects f itself.