Description
Hi,
when calling DataFrameMapper with
- default = None
- The feature selection specified using sklearn's make_column_selector,
The output of transform() is incorrect, as it passes through ALL columns, not only the ones unaffected by the transformation - as is intended.
First, I have monkeypatched sklearn-pandas to insert some prints into:
def _unselected_columns(self, X):
"""
Return list of columns present in X and not selected explicitly in the
mapper.
Unselected columns are returned in the order they appear in the
dataframe to avoid issues with different ordering during default fit
and transform steps.
"""
X_columns = list(X.columns)
unselected = [column for column in X_columns if
column not in self._selected_columns
and column not in self.drop_cols]
print(f"Selected: {list(self._selected_columns)}")
print(f"Unselected: {unselected}")
return unselected
First, if the columns for the feature are passed as a list of strings (by resolving the column_selector directly), everything works as it should:
from sklearn.datasets import fetch_openml
from sklearn_pandas import DataFrameMapper
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
dtype_selection = ["category", "object"]
categorical_features = make_column_selector(dtype_include=dtype_selection)
categorical_features = categorical_features(X)
print(f"{categorical_features=}, {type(categorical_features)=}")
mapper = DataFrameMapper(
[(
categorical_features,
DebugTransformer() # Does nothing, returns self in fit(), returns input X in transform()
)],
df_out=True,
input_df=True,
default = None,
)
mapper.fit(X, y)
out = mapper.transform(X)
print("Output columns: ", out.columns)
>>>Selector: categorical_features=['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest'], type(categorical_features)=<class 'list'>
>>>Selected: ['embarked', 'boat', 'home.dest', 'sex', 'ticket', 'name', 'cabin']
>>>Unselected: ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
>>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'body']
However, if the column_selector is passed to the features, tuple, the output is incorrect. The selected columns are duplicated in the final output:
from sklearn.datasets import fetch_openml
from sklearn_pandas import DataFrameMapper
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
dtype_selection = ["category", "object"]
categorical_features = make_column_selector(dtype_include=dtype_selection)
print(f"{categorical_features=}, {type(categorical_features)=}")
mapper = DataFrameMapper(
[(
categorical_features,
DebugTransformer() # Does nothing, returns self in fit(), returns input X in transform()
)],
df_out=True,
input_df=True,
default = None,
)
mapper.fit(X, y)
out = mapper.transform(X)
print("Output columns: ", out.columns)
>>>Selector: categorical_features=<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6CB47F0>, type(categorical_features)=<class 'sklearn.compose._column_transformer.make_column_selector'>
>>>Selected: [<sklearn.compose._column_transformer.make_column_selector object at 0x0000021FC6B0F7C0>]
>>>Unselected: ['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
>>>Output columns: ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat', 'home.dest', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']
This happens because the selected columns are not filtered out properly in the _unselected_columns
method, in this part:
unselected = [column for column in X_columns if
column not in self._selected_columns # <-- this outputs a function, 'not in' won't work.
and column not in self.drop_cols]
where the selector function is added as a column during the _selected_columns
property in the line selected_columns.add(columns)
.
To solve this, add handling for sklearn's column_selector either within _unselected_columns
or the property _selected_columns
.
A temporary solution for others that encountered this: directly use the dtype column_selector on your input dataframe, then only pass columns to the features
parameter as lists, etc.