Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to DataFrameMapper to add missing columns #111

Open
gsmafra opened this issue Jul 7, 2017 · 4 comments
Open

Add an option to DataFrameMapper to add missing columns #111

gsmafra opened this issue Jul 7, 2017 · 4 comments

Comments

@gsmafra
Copy link
Contributor

gsmafra commented Jul 7, 2017

I am currently working on a workflow where we convert database records directly to a pandas DataFrame then applying ML algorithms on it with the help of sklearn-pandas. However, sometimes we have the problem that these records don't have all the features used for prediction and I have to add those columns to the DataFrame, and for that I did a custom transformer to be applied before DataFrameMapper:

from sklearn.pipeline import BaseEstimator, TransformerMixin


class ColumnInserter(BaseEstimator, TransformerMixin):

    def __init__(self):

        self.columns = []

    def fit(self, df=None, y=None):

        self.columns = list(df.keys())
        return self

    def transform(self, df):

        df_new = df.copy()

        # insert missing columns
        missing_cols = set(self.columns) - set(df.columns)
        for col in missing_cols:
            df_new[col] = None

        return df_new

Maybe it would be useful also to others to have this kind of feature in sklearn-pandas itself, probably using the columns specified in the features parameter.

@arnau126
Copy link
Collaborator

I might add an option to the DataFrameMapper.__init__ called missing_features.

This parameter would have 2 options:

  • 'raise' (default). Raise an error if some feature is missing (current behaviour).
  • 'add'. Fill the missing feature with None or NaN and pass it to the transformers.

What do you think?

@gsmafra
Copy link
Contributor Author

gsmafra commented Jul 12, 2017

@arnau126 I can't think of any other options to have in the future, so we could as well make it a boolean, couldn't we? The most intuitive name would probably be insert_missing_features or add_missing_features, don't know if that looks too long.

@dukebody
Copy link
Collaborator

I believe this functionality, if implemented, would better be a component outside of the DataFrameMapper, to avoid overloading this class with too complex custom behaviour - it's already quite complex, with lots of options.

I see it more as a kind of "column imputer" transformer. I'm good with adding this transformer as part of the package if @arnau126 agrees as well. Then we would need a PR with some extra documentation advertising this feature.

Thanks @gsmafra !

@datajanko
Copy link

datajanko commented Feb 2, 2018

I think you can incorporate this directly int a DataFrameMapper (since you can select columns multiple times). Otherwise you might want to do a Feature Union (a short implementation for data frames can be found here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants