Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LabelEncoder + Imputer + LabelBinarizer error #96

Open
paubelsan opened this issue May 15, 2017 · 5 comments
Open

LabelEncoder + Imputer + LabelBinarizer error #96

paubelsan opened this issue May 15, 2017 · 5 comments

Comments

@paubelsan
Copy link

Hi,

I'm having an error while using a LabelEncoder + Imputer + LabelBinarizer in a mapper, as a LabelEncoder output is a vector of (n_samples,) so Imputer, that calls sklearn function check_array, that calls numpy funciont atleast_2d, transforms it to (1,n_samples), so LabelBinarizer crashes:

ValueError: Multioutput target data is not supported with label binarization

How can I fix this issue?

Many thanks!

@dukebody
Copy link
Collaborator

My recommendation here is to create a subclass of LabelEncoder that transforms the output to a 2-d vector (n_samples, 1) in the proper conditions so all transformers are of the same type and compatible.

If you come up with that implementation please post it here in a PR as it is suitable to be included with sklearn-pandas.

@devforfu
Copy link
Collaborator

devforfu commented Nov 12, 2017

@paubelsan @dukebody If the proposal for this enhancement is still actual, and nobody works on it right now, I could make a try. Though I am not sure how to replicate the issue, I am getting a different exception when trying to apply sequence [LabelEncoder(), Imputer(), LabelBinarizer()], namely:

ValueError: col: Expected 2D array, got 1D array instead

And, not on LabelBinarizer step, but while imputing values.

@dukebody
Copy link
Collaborator

@devforfu I guess that @paubelsan might have different versions of numpy/pandas, but the issue looks the same to me: LabelEncoder() returns a 1-d vector, while other transformers expect 2-d vectors.

I kind of remember some conversations about creating a transformer ([CategoricalEncoder](http://contrib.scikit-learn.org/categorical-encoding/)?) in sklearn to do what LabelEncoder() does but generating 2-d vectors, for arbitrary 2-d data, including strings. I'd check this linked project and, if it doesn't fit what we want, then we can implement our own version.

@FlorisHoogenboom
Copy link
Contributor

Since this is a problem that is likely still encountered by many it may be good to write here that in the dev-0.20 version of sklearn OneHotEncoder directly supports categorical inputs without using LabelEncoder. I think this mostly resolves all issues regarding encoding using sklearn-pandas.

@nabaskes
Copy link

I'm not sure if anyone is still experiencing this problem in light of recent updates to sklearn but if you have a list of categorical variable keys you can do something like

DataFrameMapper([(c, LabelBinarizer()) for c in categorical]+[(n, None) for n in df.columns if n not in categorical])

Hopefully this can be helpful to somebody

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants