NirIzr NirIzr - 11 months ago 186
Python Question

sklearn DictVectorizer(sparse=False) with a different default value, Impute a constant

I'm building a pipeline that starts with a

that produces a sparse matrix. Specifying
changes the output from a scipy sparse matrix to a numpy dense matrix which is good, but the next stages in the pipeline complain about
values, which our obvious outcome of using the
in my case. I'd like the pipeline to consider missing dictionary values not as not available but as zero.

doesn't help me as far as I can see, because I want to "impute" with a constant value and not a statistical value dependant of other values of the column.

Following is the code I've been using:

vectorize = skl.feature_extraction.DictVectorizer(sparse=False)
variance = skl.feature_selection.VarianceThreshold()
knn = skl.neighbors.KNeighborsClassifier(4, weights='distance', p=1)

pipe = skl.pipeline.Pipeline([('vectorize', vectorize),
# here be dragons ('fillna', ),
('variance', variance),
('knn', knn)]), labels)

And some mocked dictionaries:

dict_data = [{'city': 'Dubai', 'temperature': 33., 'assume_zero_when_missing': 7},
{'city': 'London', 'temperature': 12.},
{'city': 'San Fransisco', 'temperature': 18.}]

Notiec that in this example,
is missing from most dictionaries, which will lead later estimators to complain about

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

While the result I'm hoping for is that
values will be replaced with

Answer Source

You could fill the NaNs with 0's after converting your list of dictionaries to a pandas dataframe using DF.fillna as shown:

df = pd.DataFrame(dict_data)
df.fillna(0, inplace=True)

Inorder to use it as steps inside the pipeline estimator, you could write a custom class implementing the fit and transform methods yourself as shown:

class FillingNans(object):
    Custom function for assembling into the pipeline object 
    def transform(self, X):
        nans_replaced = X.fillna(0)
        return nans_replaced

    def fit(self, X, y=None):
        return self

Then, you could modify the manual feature selection steps in pipeline as shown:

pipe = skl.pipeline.Pipeline([('vectorize', vectorize),
                             ('fill_nans', FillingNans()),
                             ('variance', variance),
                             ('knn', knn)])