Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose parameters from transformers as parameters of the mapper #159

Open
gwerbin opened this issue Jul 26, 2018 · 8 comments
Open

Expose parameters from transformers as parameters of the mapper #159

gwerbin opened this issue Jul 26, 2018 · 8 comments

Comments

@gwerbin
Copy link

gwerbin commented Jul 26, 2018

Currently, it can be hard to use a "parametric" transformer in a DataFrameMapper because the parameters of the underlying transformers aren't exposed. This means you can't adjust the parameters of one of those transformers using GridSearchCV or RandomizedSearchCV.

Example:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn_pandas import DataFrameMapper

pipeline = Pipeline([
    ('vectorizer',
        DataFrameMapper([
            ('document_contents', CountVectorizer())
        ], df_out=False)),
    ('classifier', MultinomialNB())
])

pipeline.get_params()

These are the params I get:

{'memory': None,
 'steps': [('vectorizer', DataFrameMapper(default=False, df_out=False,
           features=[('document_contents', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=None, vocabulary=None))],
           input_df=False, sparse=False)),
  ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
 'vectorizer': DataFrameMapper(default=False, df_out=False,
         features=[('document_contents', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None))],
         input_df=False, sparse=False),
 'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 'vectorizer__default': False,
 'vectorizer__df_out': False,
 'vectorizer__features': [('document_contents',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=None, vocabulary=None))],
 'vectorizer__input_df': False,
 'vectorizer__sparse': False,
 'classifier__alpha': 1.0,
 'classifier__class_prior': None,
 'classifier__fit_prior': True}

Naively, I would expect something like this

{'memory': None,
 'steps': [('vectorizer', DataFrameMapper(default=False, df_out=False,
           features=[('document_contents', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=None, vocabulary=None))],
           input_df=False, sparse=False)),
  ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
 'vectorizer': DataFrameMapper(default=False, df_out=False,
         features=[('document_contents', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None))],
         input_df=False, sparse=False),
 'classifier': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
 'vectorizer__document_contents__analyzer': 'word',
 'vectorizer__document_contents__binary': False,
 'vectorizer__document_contents__decode_error': 'strict',
 'vectorizer__document_contents__dtype': numpy.int64,
 'vectorizer__document_contents__encoding': 'utf-8',
 'vectorizer__document_contents__input': 'content',
 'vectorizer__document_contents__lowercase': True,
 'vectorizer__document_contents__max_df': 1.0,
 'vectorizer__document_contents__max_features': None,
 'vectorizer__document_contents__min_df': 1,
 'vectorizer__document_contents__ngram_range': (1, 1),
 'vectorizer__document_contents__preprocessor': None,
 'vectorizer__document_contents__stop_words': None,
 'vectorizer__document_contents__strip_accents': None,
 'vectorizer__document_contents__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vectorizer__document_contents__tokenizer': None,
 'vectorizer__document_contents__vocabulary': None,
 'vectorizer__default': False,
 'vectorizer__df_out': False,
 'vectorizer__input_df': False,
 'vectorizer__sparse': False,
 'classifier__alpha': 1.0,
 'classifier__class_prior': None,
 'classifier__fit_prior': True}

which would be very handy for, say, using GridSearchCV to compare word and character analyzers.

This seems like it shouldn't be too hard to implement. If there's interest I can start digging around the codebase to try to spend some time on it.

@devforfu
Copy link
Collaborator

@gwerbin That would be helpful to make transformers from the mapper "grid-searchable". Just need to be sure that these "deep" parameters are appropriately assigned to the nested classes with the set_params method.

@gwerbin
Copy link
Author

gwerbin commented Jul 27, 2018

@devforfu One thing I just thought of is how to handle mappers like this:

pipeline = Pipeline([
    ('vectorizer',
        DataFrameMapper([
            ('document_contents', [TextCleaner(), CountVectorizer()])
        ], df_out=False)),
    ('classifier', MultinomialNB())
])

(I made up the TextCleaner class, just for illustration)

What would the steps names be in this case? Maybe something like

'vectorizer__document_contents__0__text_cleaning_method': 'default',
'vectorizer__document_contents__1__analyzer': 'word',
'vectorizer__document_contents__1__binary': True,

@devforfu
Copy link
Collaborator

devforfu commented Aug 1, 2018

@gwerbin I guess it could be a name of class as well. As I can recall, make_pipeline() from scikit-learn creates pipeline steps names using lower-cased names of the classes this pipeline is made of. So probably in this case could be something similar:

'vectorizer__document_contents__textcleaner__text_cleaning_method': 'default',
'vectorizer__document_contents__countvectorizer__analyzer': 'word',
'vectorizer__document_contents__countvectorizer__binary': True,

@dukebody
Copy link
Collaborator

dukebody commented Aug 5, 2018

@gwerbin Thanks for your contribution! It would be certainly a very interesting feature, since currently it is impossible to adjust the internal parameters of the dataframe mapper in the pipeline in any optimization.

Would you be willing to implement such a feature? Ideally it should be as similar in interface to sklearn as possible, to be compatible with sklearn's grid or randomized searches.

@gwerbin
Copy link
Author

gwerbin commented Aug 6, 2018

@dukebody I'm willing, but can't make any guarantees on a timeline. I've been pretty busy lately and don't want to commit to anything I can't deliver.

I would also need time to familiarize myself with how parameters are passed in the current code.

If anyone else wants to pick this up, I won't be offended.

@devforfu
Copy link
Collaborator

devforfu commented Aug 7, 2018

@gwerbin If nobody else had started working on this PR, I could make a try to come up with some basic solution. Of course, we can unite our efforts as soon as you become more available.

@devforfu devforfu self-assigned this Aug 14, 2018
@devforfu
Copy link
Collaborator

Ok, I've started work on the proposed feature in my fork. There is a couple of new tests as well.

Probably some code required to implement get_params and set_params could be borrowed from the scikit-learn instead of writing a custom solution but I've decided to start with something straightforward at first. Though I think we can use TransformerPipeline for cases when only one transformer is defined for each of columns because it completely resembles the format of ('step_name', instance) expected by the pipeline.

Testing getters/setters now, next going to check if methods are compliant with GridSearchCV. As soon as the basic version is ready, will do a PR for a review/improvements.

@dukebody
Copy link
Collaborator

Marking as "good first issue" to review the PR you created @devforfu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants