Rich Thompson - 8 months ago 52

Python Question

I was messing with pandas in the interpreter and the following behavior took my by surprise:

`>>> data2 = [[1, np.nan], [2, -17]]`

>>> f2 = pd.DataFrame(data2)

>>> f2

0 1

0 1 NaN

1 2 -17.0

>>> f2.values[1, 1] = -99.0

>>> f2

0 1

0 1 NaN

1 2 -17.0

>>> type(f2.values[0, 0]), type(f2.values[1, 0])

(<class 'numpy.float64'>, <class 'numpy.float64'>)

I am unable to assign directly to the underlying array using the values attribute. However, if I explicitly start with floats, I can:

`>>> data = [[1.0, np.nan], [2.0, -17.0]]`

>>> f = pd.DataFrame(data)

>>> f

0 1

0 1.0 NaN

1 2.0 -17.0

>>> f.values[1, 1] = -99.0

>>> f

0 1

0 1.0 NaN

1 2.0 -99.0

Does anyone know a rule that would have allowed me to predict this? I feel like I must be missing something obvious.

Answer

Pandas does not guarantee when assignments to `df.values`

affect `df`

, so I would recommend *never* trying to modify `df`

via `df.values`

. How this works is an implementation detail.

Under the hood, `df`

stores its values in "blocks". The blocks are segregated by
dtype, though sometimes more than one block can have the same dtype.

When you use

```
df2 = pd.DataFrame([[1, np.nan], [2, -17]])
```

The first column has integer dtype while the second column has floating point dtype.

```
In [27]: df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
0 2 non-null int64
1 1 non-null float64
dtypes: float64(1), int64(1)
memory usage: 112.0 bytes
```

When you access the `df2.values`

property, a single NumPy array is returned. When
the `df2`

has columns of non-homogeneous dtype, Pandas promotes the dtypes to a
single common dtype. In the worst case, the common dtype may be `object`

. In this
case, the integers are promoted to floating point dtype.

```
In [28]: df2.values.dtype
Out[28]: dtype('float64')
```

The dtype promotion requires that the underlying data be *copied* into a new
NumPy array. Thus, modifying the copy returned by `df2.values`

does not affect
the original data in `df2`

.

In contrast, if the DataFrame's data is entirely of one dtype, then `f.values`

returns a view of the original data. So in this special case, modifying
`f.values`

affects `f`

itself.