tchakravarty tchakravarty - 28 days ago 7
Python Question

pandas: pandas.DataFrame.describe returns information on only one column

For a certain Kaggle dataset (rules prohibit me from sharing the data here, but is readily accessible here),

import pandas
df_train = pandas.read_csv(
"01 - Data/act_train.csv.zip"
)
df_train.describe()


I get:

>>> df_train.describe()
outcome
count 2.197291e+06
mean 4.439544e-01
std 4.968491e-01
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 1.000000e+00
max 1.000000e+00


whereas for the same dataset
df_train.columns
gives me:

>>> df_train.columns
Index(['people_id', 'activity_id', 'date', 'activity_category', 'char_1',
'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8',
'char_9', 'char_10', 'outcome'],
dtype='object')


and
df_train.dtypes
gives me:

>>> df_train.dtypes
people_id object
activity_id object
date object
activity_category object
char_1 object
char_2 object
char_3 object
char_4 object
char_5 object
char_6 object
char_7 object
char_8 object
char_9 object
char_10 object
outcome int64
dtype: object


Am I missing some reason why pandas only
describe
s one column in the dataset?

Answer

By default, describe only works on numeric dtype columns. Add a keyword-argument include='all'. From the documentation:

If include is the string ‘all’, the output column-set will match the input one.

To clarify, the default arguments to describe are include=None, exclude=None. The behavior that results is:

None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.

Also, from the Notes section:

The output DataFrame index depends on the requested dtypes:

For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.

For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.

Comments