Richard Ji - 12 days ago 9

Python Question

All 4s are really similar to me, some situations some of them might give the same result, some not. Any help will be thankfully appreciated!

Now I know and I assume that internally, factorize and LabelEncoder works the same way. And have no big diff in terms of results, and I am not sure whether they will take up similar time with large magnitude of inputs.

Get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input right, get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (will name new column 1,2,3....). So Get_dummies is better in all respectives.

Pl correct me if I am wrong! thank you!

Answer

These four encoders can be split in two categories:

- Encode
**labels into categorical variables**: Pandas`factorize`

and scikit-learn`LabelEncoder`

. The result will have 1 dimension. - Encode
**categorical variable into dummy/indicator (binary) variables**: Pandas`get_dummies`

and scikit-learn`OneHotEncoder`

. The result will have n dimensions, one by distinct value of the encoded categorical variable.

The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in **scikit-learn pipelines** with `fit`

and `transform`

methods.

Pandas `factorize`

and scikit-learn `LabelEncoder`

belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.

```
from sklearn import preprocessing
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])
print(df)
# Col Fact Lab
# 0 A 0 0
# 1 B 1 1
# 2 B 1 1
# 3 C 2 2
```

Pandas `get_dummies`

and scikit-learn `OneHotEncoder`

belong to the second category. They can be used to create binary variables. `OneHotEncoder`

can only be used with categorical integers while `get_dummies`

can be used with other type of variables.

```
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)
print(df)
# Col_A Col_B Col_C
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())
print(df)
# 0 1 2
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
```