You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The feature selection specified using sklearn's make_column_selector,
The output of transform() is incorrect, as it passes through ALL columns, not only the ones unaffected by the transformation - as is intended.
First, I have monkeypatched sklearn-pandas to insert some prints into:
def_unselected_columns(self, X):
""" Return list of columns present in X and not selected explicitly in the mapper. Unselected columns are returned in the order they appear in the dataframe to avoid issues with different ordering during default fit and transform steps. """X_columns=list(X.columns)
unselected= [columnforcolumninX_columnsifcolumnnotinself._selected_columnsandcolumnnotinself.drop_cols]
print(f"Selected: {list(self._selected_columns)}")
print(f"Unselected: {unselected}")
returnunselected
First, if the columns for the feature are passed as a list of strings (by resolving the column_selector directly), everything works as it should:
This happens because the selected columns are not filtered out properly in the _unselected_columns method, in this part:
unselected= [columnforcolumninX_columnsifcolumnnotinself._selected_columns# <-- this outputs a function, 'not in' won't work.andcolumnnotinself.drop_cols]
where the selector function is added as a column during the _selected_columns property in the line selected_columns.add(columns).
To solve this, add handling for sklearn's column_selector either within _unselected_columns or the property _selected_columns.
A temporary solution for others that encountered this: directly use the dtype column_selector on your input dataframe, then only pass columns to the features parameter as lists, etc.
The text was updated successfully, but these errors were encountered:
Hi,
when calling DataFrameMapper with
The output of transform() is incorrect, as it passes through ALL columns, not only the ones unaffected by the transformation - as is intended.
First, I have monkeypatched sklearn-pandas to insert some prints into:
First, if the columns for the feature are passed as a list of strings (by resolving the column_selector directly), everything works as it should:
However, if the column_selector is passed to the features, tuple, the output is incorrect. The selected columns are duplicated in the final output:
This happens because the selected columns are not filtered out properly in the
_unselected_columns
method, in this part:where the selector function is added as a column during the
_selected_columns
property in the lineselected_columns.add(columns)
.To solve this, add handling for sklearn's column_selector either within
_unselected_columns
or the property_selected_columns
.A temporary solution for others that encountered this: directly use the dtype column_selector on your input dataframe, then only pass columns to the
features
parameter as lists, etc.The text was updated successfully, but these errors were encountered: