Georg Heiler Georg Heiler - 1 month ago 7
Python Question

pandas to numpy array for sklearn pipeline

I have a transformer which calculates the percentage of the values per group. Initially, pandas was used because I started with pandas and colnames are nicer to handle. However, now I need to integrate into sklearn-pipeline.

How can I convert my Transformer to support numpy arrays from a sklearn pipeline instead of pandas data frames?
The point is that

self.colname
cant be used for numpy arrays and I think the grouping needs to be performed differently.

How to implement persistence of such a transformer as these weights need to be loadable from disk in order to deploy such a Transformer in a pipeline.

class PercentageTransformer(TransformerMixin):
def __init__(self, colname,typePercentage='totalTarget', _target='TARGET', _dropOriginal=True):
self.colname = colname
self._target = _target
self._dropOriginal = _dropOriginal
self.typePercentage = typePercentage

def fit(self, X, y, *_):
original = pd.concat([y,X], axis=1)
grouped = original.groupby([self.colname, self._target]).size()
if self.typePercentage == 'totalTarget':
df = grouped / original[self._target].sum()
else:
df = (grouped / grouped.groupby(level=0).sum())

if self.typePercentage == 'totalTarget':
nameCol = "pre_" + self.colname
else:
nameCol = "pre2_" + self.colname
self.nameCol = nameCol
grouped = df.reset_index(name=nameCol)
groupedOnly = grouped[grouped[self._target] == 1]
groupedOnly = groupedOnly.drop(self._target, 1)

self.result = groupedOnly
return self

def transform(self, dataF):
mergedThing = pd.merge(dataF, self.result, on=self.colname, how='left')
mergedThing.loc[(mergedThing[self.nameCol].isnull()), self.nameCol] = 0
if self._dropOriginal:
mergedThing = mergedThing.drop(self.colname, 1)
return mergedThing


It would be used in a pipeline like this:

pipeline = Pipeline([
('features', FeatureUnion([
('continuous', Pipeline([
('extract', ColumnExtractor(CONTINUOUS_FIELDS)),
])),
('factors', Pipeline([
('extract', ColumnExtractor(FACTOR_FIELDS)),
# using labelencoding and all bias
('bias', PercentageAllTransformer(FACTOR_FIELDS, _dropOriginal=True, typePercentage='totalTarget')),
]))
], n_jobs=-1)),
('estimator', estimator)
])


The
pipeline
will be fitted with
X
and
y
where both are data frames. I am unsure of
X.as_matrix
would help.

Answer
  • Converting Things Back and Forth

Pandas has a .to_records() method, and, as you mentioned, a .as_matrix() method. The .to_records() method will actually keep your column names for you. Numpy does support named columns in arrays. See here.

  • Persistence

Pandas has a pandas.to_pickle(obj, filename) method, which takes a pandas object and pickles it. There is a corresponding pandas.read_pickle(filename) method.

Numpy has a save and load function as well.

Comments