Lin Ma Lin Ma - 3 months ago 19
Python Question

transform string column of a pandas data frame into 0 1 vectors

LabelEncoder
and
OneHotEncoder
works pretty good for numpy array, which transform string into
0,1
based vectors.

My question is, is there a neat API to convert a column of a pandas data frame into
0, 1
vectors? I showed my code and raw content of the pandas data frame
123.csv
, suppose I want to binary
0, 1
for columns
c_a
,
c_b
,
c_c
, each of the 3 columns are independent, I want to binary
0, 1
for the separately independent.

Code,

import pandas as pd
sample=pd.read_csv('123.csv', sep=',',header=None)
print sample.dtypes


123.csv content,

c_a,c_b,c_c,c_d
hello,python,pandas,1.2
hi,c++,vector,1.2


Label Encoder and OneHotEncoder examples for numpy,

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

S = np.array(['b','a','c'])
le = LabelEncoder()
S = le.fit_transform(S)
print(S)
ohe = OneHotEncoder()
one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
print(one_hot)
which results in:

[1 0 2]

[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]]


Edit 1, tried
get_dummies
, and it seems results are
0.0
and
1.0
(seems
float
), is there a way to convert into integer directly?

0_c_a 0_hello 0_hi 0_ho 1_c++ 1_c_b 1_java 1_python 2_c_c 2_numpy \
0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0

Answer

Are you looking for get_dummies?

s = pd.Series(["a", "b", "a", "c"])
pd.get_dummies(s)

If you want ints:

pd.get_dummies(s).astype(np.uint8)

reference:

Pandas get_dummies to output dtype integer/bool instead of float