All 4s are really similar to me, some situations some of them might give the same result, some not. Any help will be thankfully appreciated!
Now I know and I assume that internally, factorize and LabelEncoder works the same way. And have no big diff in terms of results, and I am not sure whether they will take up similar time with large magnitude of inputs.
Get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input right, get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (will name new column 1,2,3....). So Get_dummies is better in all respectives.
Pl correct me if I am wrong! thank you!
These four encoders can be split in two categories:
LabelEncoder. The result will have 1 dimension.
OneHotEncoder. The result will have n dimensions, one by distinct value of the encoded categorical variable.
The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with
factorize and scikit-learn
LabelEncoder belong to the first category. They can be used to create categorical variables for example to transform characters into numbers.
from sklearn import preprocessing # Test data df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col']) df['Fact'] = pd.factorize(df['Col']) le = preprocessing.LabelEncoder() df['Lab'] = le.fit_transform(df['Col']) print(df) # Col Fact Lab # 0 A 0 0 # 1 B 1 1 # 2 B 1 1 # 3 C 2 2
get_dummies and scikit-learn
OneHotEncoder belong to the second category. They can be used to create binary variables.
OneHotEncoder can only be used with categorical integers while
get_dummies can be used with other type of variables.
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col']) df = pd.get_dummies(df) print(df) # Col_A Col_B Col_C # 0 1.0 0.0 0.0 # 1 0.0 1.0 0.0 # 2 0.0 1.0 0.0 # 3 0.0 0.0 1.0 from sklearn.preprocessing import OneHotEncoder, LabelEncoder df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col']) # We need to transform first character into integer in order to use the OneHotEncoder le = preprocessing.LabelEncoder() df['Col'] = le.fit_transform(df['Col']) enc = OneHotEncoder() df = DataFrame(enc.fit_transform(df).toarray()) print(df) # 0 1 2 # 0 1.0 0.0 0.0 # 1 0.0 1.0 0.0 # 2 0.0 1.0 0.0 # 3 0.0 0.0 1.0