I have a series of variables of string type and I have to transform them in order to use sklearn estimators.
I'm using DataFrameMapper from the library sklearn_pandas.
In the following example I have a dataframe with columns A,B,C,D,E. Suppose that 'A', 'B' & 'C' are string features: A has 25 unique strings, B has 10 unique strings and C has 30 unique strings. After tranforming the data by LabelBinarizer() the corresponding matrix would have 25+ 10+ 30+ 1 (from D) +1 (from E) = 67 features. How do I know which column correspond to the previous string values of each original variable?
As mentioned before the first 3 are string variables so I have to do the following transformation:
mapper = DataFrameMapper([
('C', LabelBinarizer()), (['D','E'],StandardScaler())])
X = np.array(mapper.fit_transform(df),dtype=float)
Combining DictVectorizer() and mapper it is possible to keep track the column variable names. This is useful if one wants to visualize a DecisionTree with export_graphviz.
from sklearn.feature_extraction import DictVectorizer dvec = DictVectorizer(sparse=False) X=dvec.fit_transform(df.transpose().to_dict().values()) df_t= pd.DataFrame(X,columns=dvec.get_feature_names())
df is the input DataFrame with A,B,C being categorical features. df_t is the transformed DataFrame with the categorical features encoded with its corresponding header.
So then you can scale the other numerical features D, E and transform everything into a numpy array to use in sklearn.
numerical=['D','E'] categorical=list(set(list(df_t.columns.values))-set(numerical)) mapper = DataFrameMapper([ (categorical, None), (numerical,StandardScaler())]) explanatory_variables_columns=categorical+numerical X = np.array(mapper.fit_transform(df_t),dtype=float)