-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behaviour of OneHotEncoder in a Pipeline #83
Comments
@matejklemen good catch! The problem is that Also, should we also change the API of predictors in that case 🤔? |
I like the idea, because it seems quite elegant and Some things that we need to be careful about (or that we need to think about):
Another way I can think of solving the above problem is that each |
I completely agree, we really should be careful not to complicate things too much. Do you think a simple case class with 0 functionality would work? It would really serve just as a wrapper that holds the three values.
I think so, yes.
I was thinking we could simply make
Hmm, I like this approach for its simplicity. It could also be a middle step towards the solution I proposed; each transformer would need to implement |
Sorry, I've read this and then forgot to reply. 😅 I've taken a look at this, defined two new functions for the def featureIndex(model: A): FeatureIndex = {
require(isFitted(model), "Requested modified feature index on a model that is not fitted yet")
featureIndexSafe(model)
}
protected def featureIndexSafe(model: A): FeatureIndex and started to add the implementations for concrete transformers. Almost all of them really just pass on the source This brought me to notice another potential inconvenience in Features are processed left-to-right and the new binary features are appended at the end of the data. So assuming a user knows this, they need to then know how many unique values the source feature held and do some "hacking" to select just the new features. I can add in a sentence to docs regarding how the new binary features are constructed and where they are put. And if a Let me know what you think - this is mostly about convenience really. |
Yeah me too 😅. I think a statement in the docs + a naming convention that you proposed ( |
@matejklemen can I assign this to you (to better track/label existing issues)? |
Yea, sure :) |
Describe the bug
Consider the code below, which creates a pipeline, where one hot encoder is applied first to the categorical feature and then the numerical feature is normalized to have 0 mean and variance of 1 (+ softmax, which is not important here).
To observe that everything was working correctly, I put a println inside the
fit
method ofPipeline
. I noticed, that the above code does not work as intended (seeExpected behaviour
andActual behaviour
).To Reproduce
Expected behavior
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the (now) FIRST (index 0) feature.
Actual behaviour
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the SECOND (index 1) feature.
Versions
Scala: 2.13.0
doddle-model: 0.0.1
Additional context
Prints of partial results (i.e. transformed data):
The text was updated successfully, but these errors were encountered: