Skip to content

Add an option to DataFrameMapper to add missing columns #111

Open
@gsmafra

Description

@gsmafra

I am currently working on a workflow where we convert database records directly to a pandas DataFrame then applying ML algorithms on it with the help of sklearn-pandas. However, sometimes we have the problem that these records don't have all the features used for prediction and I have to add those columns to the DataFrame, and for that I did a custom transformer to be applied before DataFrameMapper:

from sklearn.pipeline import BaseEstimator, TransformerMixin


class ColumnInserter(BaseEstimator, TransformerMixin):

    def __init__(self):

        self.columns = []

    def fit(self, df=None, y=None):

        self.columns = list(df.keys())
        return self

    def transform(self, df):

        df_new = df.copy()

        # insert missing columns
        missing_cols = set(self.columns) - set(df.columns)
        for col in missing_cols:
            df_new[col] = None

        return df_new

Maybe it would be useful also to others to have this kind of feature in sklearn-pandas itself, probably using the columns specified in the features parameter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions