Richard Ji - 2 months ago 45

Python Question

All 4s are really similar to me, some situations some of them might give the same result, some not. Any help will be thankfully appreciated!

Now I know and I assume that internally, factorize and LabelEncoder works the same way. And have no big diff in terms of results, and I am not sure whether they will take up similar time with large magnitude of inputs.

Get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input right, get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (will name new column 1,2,3....). So Get_dummies is better in all respectives.

Pl correct me if I am wrong! thank you!

Answer

These four encoders can be split in two categories:

- Encode
**labels into categorical variables**: Pandas`factorize`

and scikit-learn`LabelEncoder`

. The result will have 1 dimension. - Encode
**categorical variable into dummy/indicator (binary) variables**: Pandas`get_dummies`

and scikit-learn`OneHotEncoder`

. The result will have n dimensions, one by distinct value of the encoded categorical variable.

The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in **scikit-learn pipelines** with `fit`

and `transform`

methods.

Pandas `factorize`

and scikit-learn `LabelEncoder`

belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.

```
from sklearn import preprocessing
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])
print(df)
# Col Fact Lab
# 0 A 0 0
# 1 B 1 1
# 2 B 1 1
# 3 C 2 2
```

Pandas `get_dummies`

and scikit-learn `OneHotEncoder`

belong to the second category. They can be used to create binary variables. `OneHotEncoder`

can only be used with categorical integers while `get_dummies`

can be used with other type of variables.

```
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)
print(df)
# Col_A Col_B Col_C
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())
print(df)
# 0 1 2
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
```