tom tom - 3 years ago 179
Python Question

Retaining categorical dtype upon dataframe concatenation

I have two dataframes with identical column names and dtypes, similar to the following:

A object
B category
C category


The categories are not identical in each of the dataframes.

When normally concatinating, pandas outputs:

A object
B object
C object


Which is the expected behaviour as per the documentation.

However, I wish to keep the categorisation and wish to union the categories, so I have tried the union_categoricals across the columns in the dataframe which are both categorical.
cdf
and
df
are my two dataframes.

for column in df:
if df[column].dtype.name == "category" and cdf[column].dtype.name == "category":
print (column)
union_categoricals([cdf[column], df[column]], ignore_order=True)

cdf = pd.concat([cdf,df])


This is still not providing me with a categorical output.

Answer Source

I don't think this is completely obvious from the documentation, but you could do something like the following. Here's some sample data:

df1=pd.DataFrame({'x':pd.Categorical(['dog','cat'])})
df2=pd.DataFrame({'x':pd.Categorical(['cat','rat'])})

Use union_categoricals1 to get consistent categories accros dataframes. Try df.x.cat.codes if you need to convince yourself that this works.

from pandas.api.types import union_categoricals

uc = union_categoricals([df1.x,df2.x])
df1.x = pd.Categorical( df1.x, categories=uc.categories )
df2.x = pd.Categorical( df2.x, categories=uc.categories )

Concatenate and verify the dtype is categorical.

df3 = pd.concat([df1,df2])

df3.x.dtypes
category

As @C8H10N4O2 suggests, you could also just coerce from objects back to categoricals after concatenating. Honestly, for smaller datasets I think that's the best way to do it just because it's simpler. But for larger dataframes, using union_categoricals should be much more memory efficient.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download