user2808117 user2808117 - 4 years ago 208
Python Question

Pandas 'describe' is not returning summary of all columns

I am running 'describe()' on a dataframe and getting summaries of only int columns (pandas 14.0).

The documentation says that for object columns frequency of most common value, and additional statistics would be returned. What could be wrong? (no error message is returned by the way)

Edit:

I think it's how the function is set to behave on mixed column types in a dataframe. Although the documentation fails to mention it.

Example code:

df_test = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
df_test.dtypes
df_test.describe()
df_test['$a'] = df_test['$a'].astype(str)
df_test.describe()
df_test['$a'].describe()
df_test['$b'].describe()


My ugly work around in the meanwhile:

def my_df_describe(df):
objects = []
numerics = []
for c in df:
if (df[c].dtype == object):
objects.append(c)
else:
numerics.append(c)

return df[numerics].describe(), df[objects].describe()

Answer Source

As of pandas v15.0, use the parameter, DataFrame.describe(include = all) to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.

Example:

In[1]:

df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
df.describe(include = 'all')

Out[1]:

        $a    $b
count   5   5.000000
unique  4   NaN
top     a   NaN
freq    2   NaN
mean    NaN 2.000000
std     NaN 1.581139
min     NaN 0.000000
25%     NaN 1.000000
50%     NaN 2.000000
75%     NaN 3.000000
max     NaN 4.000000

The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.

Summarizing only numerical or object columns

  1. To call describe() on just the numerical columns use describe(include = [np.number])
  2. To call describe() on just the objects (strings) using describe(include = ['O']).

    In[2]:
    
    df.describe(include = [np.number])
    
    Out[3]:
    
             $b
    count   5.000000
    mean    2.000000
    std     1.581139
    min     0.000000
    25%     1.000000
    50%     2.000000
    75%     3.000000
    max     4.000000
    
    In[3]:
    
    df.describe(include = ['O'])
    
    Out[3]:
    
        $a
    count   5
    unique  4
    top     a
    freq    2
    
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download