JonathanBechtel JonathanBechtel - 2 months ago 66
Python Question

LabelEncoder().fit_transform vs. pd.get_dummies for categorical coding

It was recently brought to my attention that if you have a dataframe

df
like this:

A B C
0 0 Boat 45
1 1 NaN 12
2 2 Cat 6
3 3 Moose 21
4 4 Boat 43


You can encode the categorical data automatically with
pd.get_dummies
:

df1 = pd.get_dummies(df)


Which yields this:

A C B_Boat B_Cat B_Moose
0 0 45 1.0 0.0 0.0
1 1 12 0.0 0.0 0.0
2 2 6 0.0 1.0 0.0
3 3 21 0.0 0.0 1.0
4 4 43 1.0 0.0 0.0


I typically use
LabelEncoder().fit_transform
for this sort of task before putting it in
pd.get_dummies
, but if I can skip a few steps that'd be desirable.

Am I losing anything by simply using
pd.get_dummies
on my entire dataframe to encode it?

Answer

Yes, you can skip the use of LabelEncoder if you only want to encode string features. On the other hand if you have a categorical column of integers (instead of strings) then pd.get_dummies will leave as it is (see your A or C column for example). In that case you should use OneHotEncoder. Ideally OneHotEncoder would support both integer and strings but this is being worked on at the moment.