Pablo Fleurquin Pablo Fleurquin - 4 months ago 22
Python Question

Is there a way to track which DataFrame Column corresponds to which Array Column(s) after LabelBinarizer Transform in sklearn?

I have a series of variables of string type and I have to transform them in order to use sklearn estimators.

I'm using DataFrameMapper from the library sklearn_pandas.

In the following example I have a dataframe with columns A,B,C,D,E. Suppose that 'A', 'B' & 'C' are string features: A has 25 unique strings, B has 10 unique strings and C has 30 unique strings. After tranforming the data by LabelBinarizer() the corresponding matrix would have 25+ 10+ 30+ 1 (from D) +1 (from E) = 67 features. How do I know which column correspond to the previous string values of each original variable?

As mentioned before the first 3 are string variables so I have to do the following transformation:

mapper = DataFrameMapper([
('A', LabelBinarizer()),
('B', LabelBinarizer()),
('C', LabelBinarizer()), (['D','E'],StandardScaler())])

X = np.array(mapper.fit_transform(df),dtype=float)


Where X is matrix of size (num_features)*67

Answer

Combining DictVectorizer() and mapper it is possible to keep track the column variable names. This is useful if one wants to visualize a DecisionTree with export_graphviz.

The answer is based on: http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/preprocessing/feature_encoding.ipynb

    from sklearn.feature_extraction import DictVectorizer
    dvec = DictVectorizer(sparse=False)
    X=dvec.fit_transform(df.transpose().to_dict().values())
    df_t= pd.DataFrame(X,columns=dvec.get_feature_names())

df is the input DataFrame with A,B,C being categorical features. df_t is the transformed DataFrame with the categorical features encoded with its corresponding header.

So then you can scale the other numerical features D, E and transform everything into a numpy array to use in sklearn.

numerical=['D','E']
categorical=list(set(list(df_t.columns.values))-set(numerical))

mapper = DataFrameMapper([
(categorical, None), 
(numerical,StandardScaler())])

explanatory_variables_columns=categorical+numerical
X = np.array(mapper.fit_transform(df_t),dtype=float)
  • Although there is no transformation to be done on 'A', 'B' and 'C' you will have to include them in the mapper and use None to express "do nothing".