tkja tkja - 1 year ago 86
Python Question

scikit-learn FeatureUnion gridsearch over subsets of features

How can I use a FeatureUnion in scikit learn, so that the Gridsearch can treat its parts optionally?

The code below works and sets up a FeatureUnion of a TfidfVectorizer for words and a TfidfVectorizer for chars.

When doing a Gridsearch, in addition to testing the defined parameter space, I would also like to test only 'vect__wordvect' with its ngram_range (without there being a TfidfVectorizer for chars), and also only 'vect__lettervect' with the lowercase parameter, the other TfidfVectorizer being disabled.

How can this be done?

# setup the featureunion
wordvect = TfidfVectorizer(analyzer='word')
lettervect = CountVectorizer(analyzer='char')
featureunionvect = FeatureUnion([("lettervect", lettervect), ("wordvect", wordvect)])

# setup the pipeline
classifier = LogisticRegression(class_weight='balanced')
pipeline = Pipeline([('vect', featureunionvect), ('classifier', classifier)])

# gridsearch
parameters = {'vect__wordvect__ngram_range': [(1, 1), (1, 2)],
'vect__lettervect__lowercase': [True, False]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1)


The FeatureUnion has a parameter called transformer_list that you could use to grid-search over; so in your case your grid search parameters would become

parameters = {'vect__wordvect__ngram_range': [(1, 1), (1, 2)],
              'vect__lettervect__lowercase': [True, False],
              'vect__transformer_list: [[('wordvec':wordvec)],
                                        [('wordvec':wordvec), ('lettervec':lettervec)]]}