MYjx MYjx - 1 year ago 97
Python Question

how to apply preprocessing methods on several columns at one time in sklearn

My question is I have so many columns in my pandas data frame and I am trying to apply the sklearn preprocessing using dataframe mapper from sklearn-pandas library such as

mapper= DataFrameMapper([

I am just wondering whether there is another more succinct way for me to preprocess many variables at one time without writing them out explicitly.

Another thing that I found a little bit annoying is when I transformed all the pandas data frame into arrays which sklearn can work with, they will lose the column name features, which makes the selection very difficult. Does anyone knows how to preserve the column names as the key when change the pandas data frames to np arrays?

Thank you so much!

Answer Source
from sklearn.preprocessing import LabelBinarizer, LabelEncoder, StandardScaler
from sklearn_pandas import DataFrameMapper

encoders = ['gradelevel', 'subject', 'districtid']
scalars = ['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5']
mapper = DataFrameMapper(
    [('gender', LabelBinarizer())] +
    [(encoder, LabelEncoder()) for encoder in encoders] +
    [(scalar, StandardScaler()) for scalar in scalars]

If you're doing this a lot, you could even write your own function:

mapper = data_frame_mapper(binarizers=['gender'],
    encoders=['gradelevel', 'subject', 'districtid'],
    scalars=['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5'])