CV on Params in DataFrameMapper Transforms? #124

andrewm4894 · 2017-09-06T10:44:05Z

Apologies for posting as an issue but feel like could be a useful use case.

I'm just wondering if something like what i'm trying to do is or should be possible.

If i set up a pipeline like:

# make pipeline for individual variables
name_to_tfidf = Pipeline([ ('name_vect', CountVectorizer()) , ('name_tfidf', TfidfTransformer()) ])
ticket_to_tfidf = Pipeline([ ('ticket_vect', CountVectorizer()) , ('ticket_tfidf', TfidfTransformer()) ])

full_mapper = DataFrameMapper([
    ('Name', name_to_tfidf ),
    ('Ticket', ticket_to_tfidf ),
    ('Sex', LabelBinarizer())
    ])

# build full pipeline
full_pipeline  = Pipeline([
    ('mapper',full_mapper),
    ('clf', SGDClassifier(n_iter=15, warm_start=True))
])

Is there a way to pass a list of options to CV on for individual transforms in the DataFrameMapper like here:

# determine full param search space (need to get the params for the mapper parts in here somehow)
full_params = {'clf__alpha': [1e-2,1e-3,1e-4],
               'clf__loss':['modified_huber','hinge'],
               'clf__penalty':['l2','l1'],
               # now set the params for the datamapper part of the pipeline
               'mapper__features':[[
                   ('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),
                   ('Ticket',deepcopy(ticket_to_tfidf).set_params(ticket_vect__analyzer = ['char', 'char_wb']))
               ]]
              }

Ideally id like to CV on what params are best for the name_to_tfidf and ticket_to_tfidf DataFrameMapper pipelines.

But passing a list of options to set_params() like this gives me this error when i go to fit:

ValueError: ['char', 'char_wb'] is not a valid tokenization scheme/analyzer

The text was updated successfully, but these errors were encountered:

scotthuang1989 · 2017-09-08T07:55:10Z

I think what you want is this GrideSearchCV, just create a GridSearchCV, pass to pipeline as a "normal " estimator, then you will get what you want.

andrewm4894 · 2017-09-08T08:57:35Z

My bad - I left that part out. I am doing this:

# set up grid search
gs_clf = GridSearchCV(full_pipeline, full_params, n_jobs=-1)

And then:

# do the fit
gs_clf.fit(df,df['Survived'])

So i am able to do the CV on the clf params but id also like to do CV on some params within the transforms in the DataFrameMapper - just not sure how to go about this.

Here is a full example notebook.

Basically i was passing ['char', 'char_wb'] to this line for example:
('Name',deepcopy(name_to_tfidf).set_params(name_vect__analyzer = ['char', 'char_wb'])),

As i was hoping the GridSearchCV would then also consider those two params in the grid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CV on Params in DataFrameMapper Transforms? #124

CV on Params in DataFrameMapper Transforms? #124

andrewm4894 commented Sep 6, 2017

scotthuang1989 commented Sep 8, 2017

andrewm4894 commented Sep 8, 2017

CV on Params in DataFrameMapper Transforms? #124

CV on Params in DataFrameMapper Transforms? #124

Comments

andrewm4894 commented Sep 6, 2017

scotthuang1989 commented Sep 8, 2017

andrewm4894 commented Sep 8, 2017