Georg Heiler Georg Heiler - 2 months ago 49
Python Question

sklearn function transformer in pipeline

Writing my first pipeline for sk-learn I stumbled upon some issues when only a subset of columns is put into a pipeline:

mydf = pd.DataFrame({'classLabel':[0,0,0,1,1,0,0,0],
'categorical':[7,8,9,5,7,5,6,4],
'numeric1':[7,8,9,5,7,5,6,4],
'numeric2':[7,8,9,5,7,5,6,"N.A"]})
columnsNumber = ['numeric1']
XoneColumn = X[columnsNumber]


I use the
functionTransformer
like:

def extractSpecificColumn(X, columns):
return X[columns]

pipeline = Pipeline([
('features', FeatureUnion([
('continuous', Pipeline([
('numeric', FunctionTransformer(columnsNumber)),
('scale', StandardScaler())
]))
], n_jobs=1)),
('estimator', RandomForestClassifier(n_estimators=50, criterion='entropy', n_jobs=-1))
])

cv.cross_val_score(pipeline, XoneColumn, y, cv=folds, scoring=kappaScore)


This results in:
TypeError: 'list' object is not callable
when the function transformer is enabled.

Please see https://github.com/geoHeil/pythonQuestions/blob/master/pipeline01.ipynb /question3 at the bottom for details.

edit:



If I instantiate a
ColumnExtractor
like below no error is returned. But isn't the
functionTransformer
meant just for simple cases like this one and should just work?

class ColumnExtractor(TransformerMixin):
def __init__(self, columns):
self.columns = columns

def transform(self, X, *_):
return X[self.columns]

def fit(self, *_):
return self

Answer

FunctionTransformer is used to "lift" a function to a transformation which I think can help with some data cleaning steps. Imagine you have a mostly numeric array and you want to transform it with a Transformer that that will error out if it gets a nan (like Normalize). You might end up with something like

df.fillna(0, inplace=True)
...
cross_val_score(pipeline, ...)

but maybe you that fillna is only required in one transformation so instead of having the fillna like above, you have

normalize = make_pipeline(
    FunctionTransformer(np.nan_to_num, validate=False),
    Normalize()
)

which ends up normalizing it as you want. Then you can use that snippet in more places without littering your code with .fillna(0)

In your example, you're passing in ['numeric1'] which is a list and not an extractor like the similarly typed df[['numeric1']]. What you may want instead is more like

FunctionTransformer(operator.itemgetter(columns))

but that still wont work because the object that is ultimately passed into the FunctionTransformer will be an np.array and not a DataFrame.

In order to do operations on particular columns of a DataFrame, you may want to use a library like sklearn-pandas which allows you to define particular transformers by column.