tchakravarty tchakravarty - 1 year ago 124
Python Question

pandas: pandas.DataFrame.describe returns information on only one column

For a certain Kaggle dataset (rules prohibit me from sharing the data here, but is readily accessible here),

import pandas
df_train = pandas.read_csv(
"01 - Data/"

I get:

>>> df_train.describe()
count 2.197291e+06
mean 4.439544e-01
std 4.968491e-01
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 1.000000e+00
max 1.000000e+00

whereas for the same dataset
gives me:

>>> df_train.columns
Index(['people_id', 'activity_id', 'date', 'activity_category', 'char_1',
'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8',
'char_9', 'char_10', 'outcome'],

gives me:

>>> df_train.dtypes
people_id object
activity_id object
date object
activity_category object
char_1 object
char_2 object
char_3 object
char_4 object
char_5 object
char_6 object
char_7 object
char_8 object
char_9 object
char_10 object
outcome int64
dtype: object

Am I missing some reason why pandas only
s one column in the dataset?

Answer Source

By default, describe only works on numeric dtype columns. Add a keyword-argument include='all'. From the documentation:

If include is the string ‘all’, the output column-set will match the input one.

To clarify, the default arguments to describe are include=None, exclude=None. The behavior that results is:

None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.

Also, from the Notes section:

The output DataFrame index depends on the requested dtypes:

For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.

For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download