Pablo Fleurquin - 1 year ago 88

Python Question

I have a series of variables of string type and I have to transform them in order to use sklearn estimators.

I'm using DataFrameMapper from the library sklearn_pandas.

In the following example I have a dataframe with columns A,B,C,D,E. Suppose that 'A', 'B' & 'C' are string features: A has 25 unique strings, B has 10 unique strings and C has 30 unique strings. After tranforming the data by LabelBinarizer() the corresponding matrix would have 25+ 10+ 30+ 1 (from D) +1 (from E) = **67 features**. *How do I know which column correspond to the previous string values of each original variable?*

As mentioned before the first 3 are string variables so I have to do the following transformation:

`mapper = DataFrameMapper([`

('A', LabelBinarizer()),

('B', LabelBinarizer()),

('C', LabelBinarizer()), (['D','E'],StandardScaler())])

X = np.array(mapper.fit_transform(df),dtype=float)

Where X is matrix of size (num_features)*67

Answer Source

Combining DictVectorizer() and mapper it is possible to keep track the column variable names. This is useful if one wants to visualize a DecisionTree with export_graphviz.

The answer is based on: http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/preprocessing/feature_encoding.ipynb

```
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse=False)
X=dvec.fit_transform(df.transpose().to_dict().values())
df_t= pd.DataFrame(X,columns=dvec.get_feature_names())
```

df is the input DataFrame with A,B,C being categorical features. df_t is the transformed DataFrame with the categorical features encoded with its corresponding header.

So then you can scale the other numerical features D, E and transform everything into a numpy array to use in sklearn.

```
numerical=['D','E']
categorical=list(set(list(df_t.columns.values))-set(numerical))
mapper = DataFrameMapper([
(categorical, None),
(numerical,StandardScaler())])
explanatory_variables_columns=categorical+numerical
X = np.array(mapper.fit_transform(df_t),dtype=float)
```

- Although there is no transformation to be done on 'A', 'B' and 'C' you will have to include them in the mapper and use None to express "do nothing".