Pablo Fleurquin - 10 months ago 65

Python Question

I have a series of variables of string type and I have to transform them in order to use sklearn estimators.

I'm using DataFrameMapper from the library sklearn_pandas.

In the following example I have a dataframe with columns A,B,C,D,E. Suppose that 'A', 'B' & 'C' are string features: A has 25 unique strings, B has 10 unique strings and C has 30 unique strings. After tranforming the data by LabelBinarizer() the corresponding matrix would have 25+ 10+ 30+ 1 (from D) +1 (from E) = **67 features**. *How do I know which column correspond to the previous string values of each original variable?*

As mentioned before the first 3 are string variables so I have to do the following transformation:

`mapper = DataFrameMapper([`

('A', LabelBinarizer()),

('B', LabelBinarizer()),

('C', LabelBinarizer()), (['D','E'],StandardScaler())])

X = np.array(mapper.fit_transform(df),dtype=float)

Where X is matrix of size (num_features)*67

Answer

Combining DictVectorizer() and mapper it is possible to keep track the column variable names. This is useful if one wants to visualize a DecisionTree with export_graphviz.

The answer is based on: http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/preprocessing/feature_encoding.ipynb

```
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse=False)
X=dvec.fit_transform(df.transpose().to_dict().values())
df_t= pd.DataFrame(X,columns=dvec.get_feature_names())
```

df is the input DataFrame with A,B,C being categorical features. df_t is the transformed DataFrame with the categorical features encoded with its corresponding header.

So then you can scale the other numerical features D, E and transform everything into a numpy array to use in sklearn.

```
numerical=['D','E']
categorical=list(set(list(df_t.columns.values))-set(numerical))
mapper = DataFrameMapper([
(categorical, None),
(numerical,StandardScaler())])
explanatory_variables_columns=categorical+numerical
X = np.array(mapper.fit_transform(df_t),dtype=float)
```

- Although there is no transformation to be done on 'A', 'B' and 'C' you will have to include them in the mapper and use None to express "do nothing".

Source (Stackoverflow)