Bug when using sklearn's make_column_selector & default=None

Hi,

when calling DataFrameMapper with 
1.  default = None 
2.  The feature selection specified using sklearn's make_column_selector, 

The output of transform() is incorrect, as it passes through ALL columns, not only the ones unaffected by the transformation - as is intended.


First, I have monkeypatched sklearn-pandas to insert some prints into:

```python
    def _unselected_columns(self, X):
        """
        Return list of columns present in X and not selected explicitly in the
        mapper.

        Unselected columns are returned in the order they appear in the
        dataframe to avoid issues with different ordering during default fit
        and transform steps.
        """
        X_columns = list(X.columns)
        
        unselected = [column for column in X_columns if
                column not in self._selected_columns
                and column not in self.drop_cols]
        
        print(f"Selected: {list(self._selected_columns)}")
        print(f"Unselected: {unselected}")

        return unselected
```

First, if the columns for the feature are passed as a list of strings (by resolving the column_selector directly), everything works as it should:


```python
from sklearn.datasets import fetch_openml
from sklearn_pandas import DataFrameMapper

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
dtype_selection = ["category", "object"]

categorical_features = make_column_selector(dtype_include=dtype_selection)
categorical_features = categorical_features(X)
print(f"{categorical_features=}, {type(categorical_features)=}")

mapper =  DataFrameMapper(
        [(
            categorical_features,
            DebugTransformer()     # Does nothing, returns self in fit(), returns input X in transform()
        )],
        df_out=True,
        input_df=True,
        default = None,           
 )

 mapper.fit(X, y)
 out = mapper.transform(X)
 print("Output columns: ", out.columns)

>>>Selector: categorical_features=['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest'], type(categorical_features)=<class 'list'>
>>>Selected: ['embarked', 'boat', 'home.dest', 'sex', 'ticket', 'name', 'cabin']
>>>Unselected: ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
>>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
```

However, if the column_selector is passed to the features, tuple, the output is incorrect. The selected columns are duplicated in the final output:

```python
from sklearn.datasets import fetch_openml
from sklearn_pandas import DataFrameMapper

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
dtype_selection = ["category", "object"]

categorical_features = make_column_selector(dtype_include=dtype_selection)
print(f"{categorical_features=}, {type(categorical_features)=}")

mapper =  DataFrameMapper(
        [(
            categorical_features,
            DebugTransformer()     # Does nothing, returns self in fit(), returns input X in transform()
        )],
        df_out=True,
        input_df=True,
        default = None,           
 )

 mapper.fit(X, y)
 out = mapper.transform(X)
 print("Output columns: ", out.columns)
>>>Selector: categorical_features=<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6CB47F0>, type(categorical_features)=<class 'sklearn.compose._column_transformer.make_column_selector'>
>>>Selected: [<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6B0F7C0>]
>>>Unselected: ['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
>>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']

```
This happens because the selected columns are not filtered out properly in the `_unselected_columns` method, in this part:

```python
unselected = [column for column in X_columns if
                column not in self._selected_columns   # <-- this outputs a function, 'not in' won't work.
                and column not in self.drop_cols]
```
where the selector function is added as a column during the `_selected_columns` property in the line `selected_columns.add(columns)`.


To solve this, add handling for sklearn's column_selector either within `_unselected_columns` or the property `_selected_columns`.
A temporary solution for others that encountered this: directly use the dtype column_selector on your input dataframe, then only pass columns to the `features` parameter as lists, etc.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug when using sklearn's make_column_selector & default=None #259

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug when using sklearn's make_column_selector & default=None #259

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions