user4979733 user4979733 - 7 days ago 5
Python Question

Pandas: Why is default column type for numeric float?

I am using Pandas 0.18.1 with python 2.7.x. I have an empty dataframe that I read first. I see that the types of these columns are

object
which is OK. When I assign one row of data, the type for numeric values changes to
float64
. I was expecting
int
or
int64
. Why does this happen?

Is there a way to set some global option to let Pandas knows that for numeric values, treat them by default as
int
unless the data has a
.
? For example,
[0 1.0, 2.]
, first column is
int
but other two are
float64
?

For example:

>>> df = pd.read_csv('foo.csv', engine='python', keep_default_na=False)
>>> print df.dtypes
bbox_id_seqno object
type object
layer object
ll_x object
ll_y object
ur_x object
ur_y object
polygon_count object
dtype: object
>>> df.loc[0] = ['a', 'b', 'c', 1, 2, 3, 4, 5]
>>> print df.dtypes
bbox_id_seqno object
type object
layer object
ll_x float64
ll_y float64
ur_x float64
ur_y float64
polygon_count float64
dtype: object

Answer

It's not possible for Pandas to store NaN values in integer columns.

This makes float the obvious default choice for data storage, because as soon as missing value arises Pandas would have to change the data type for the entire column. And missing values arise very often in practice.

As for why this is, it's a restriction inherited from Numpy. Basically, Pandas needs to set aside a particular bit pattern to represent NaN. This is straightforward for floating point numbers and it's defined in the IEEE 754 standard. It's more awkward and less efficient to do this for a fixed-width integer.

Comments