Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrameMapper.get_feature_names (wrapper for transformed_features_) #109

Open
molaxx opened this issue Jul 4, 2017 · 13 comments
Open

Comments

@molaxx
Copy link

molaxx commented Jul 4, 2017

As this function is sort of the de facto standard in sklearn (implemented in FeatureUnion, CountVectorizer, PolynomialFeatures, DictVectorizer) , it would reduce friction when using DataFrameMapper.

@arnau126
Copy link
Collaborator

I think it's a good idea.

The only consideration is that transformed_features_ is filled when DataFrameMapper.transform is called and it could change after every subsequent transform call.

While in contrast, the sklearn's get_feature_names() only needs the transformer to be fitted, and the features names doesn't change after a transform call.

@dukebody
Copy link
Collaborator

@molaxx thanks for the suggestion, I also think it's a good idea, with the caveat that @arnau126 mentions. Can you submit a PR with the implementation and some modifications to the README to indicate the availability of the method? Thanks.

@molaxx
Copy link
Author

molaxx commented Aug 6, 2017

@arnau126 this behavior seems like a bug. Why would we want the features to change after each transform? Further more, after fitting and pickling a model, the feature set used for training is lost.
I think this logic should move to DataFrameMapper.fit. What do you think?

@arnau126
Copy link
Collaborator

arnau126 commented Aug 10, 2017

Currently it's not possible to move this logic to fit because get_names (the function used to build transformed_features_) needs the transformed columns.

And it needs them because:
- not all transformers have get_feature_names() or classes_.
- sometimes classes_ doesn't contain the feature names.

https://github.com/pandas-dev/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L241

@molaxx
Copy link
Author

molaxx commented Aug 13, 2017

Couldn't this be handled by transforming a single row after the fitting? It's a bit hacky, but not having feature names after a fit is a bit surprising. I like boring API's :)
Plus, I'll probably end up doing this manually so the features names would be available right after unpickling the model.

@dukebody
Copy link
Collaborator

@molaxx I don't like the idea of transforming just one row to be able to get the feature names, it's too hacky. I understand it can be surprising that one needs to transform the data to be able to get the column names, but this is due to the complex nature of the custom transformers.

What we can do is to try to get these from the last transformer for each column during fit, like FeatureUnion does, and fail if they cannot be extracted, with a message indicating that one has to transform first to get inferred column names in that cases.

Are you up for PR such a feature?

@molaxx
Copy link
Author

molaxx commented Aug 20, 2017 via email

@dukebody
Copy link
Collaborator

dukebody commented Sep 3, 2017

I believe that any transformer that doesn't have a classes_ attribute or a get_feature_names method, which are the ones that _get_feature_names leverages, will work for the test. You can create a mock one in the tests.

@zouzias
Copy link

zouzias commented Feb 5, 2018

@molaxx OneHotEncoder is an example of a transformer without classes_ or get_feature_names(). See http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

@JohnPaton
Copy link

@arnau126 this behavior seems like a bug. Why would we want the features to change after each transform? Further more, after fitting and pickling a model, the feature set used for training is lost.
I think this logic should move to DataFrameMapper.fit. What do you think?

Just popping in to say that I just spent ages trying to debug different numbers of columns in my training and test sets because it turns out the test set had lan extra label in a column that was being one-hot encoded. It would have been way easier to have some exception along the lines of "Number of columns from feature <feature> after transformation does not match last call" or something like that.

@iDmple
Copy link

iDmple commented Oct 18, 2019

What's the status on this?

Is there any known workaround?

@ragrawal
Copy link
Collaborator

ragrawal commented May 8, 2021

@JohnPaton , @iDmple can you provide a simple example that I can use to build and test the solution.

@JohnPaton
Copy link

Sorry, that was 2 years ago, I don't have it lying around now 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants