Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour of OneHotEncoder in a Pipeline #83

Open
matejklemen opened this issue Sep 11, 2019 · 7 comments
Open

Behaviour of OneHotEncoder in a Pipeline #83

matejklemen opened this issue Sep 11, 2019 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@matejklemen
Copy link
Member

Describe the bug
Consider the code below, which creates a pipeline, where one hot encoder is applied first to the categorical feature and then the numerical feature is normalized to have 0 mean and variance of 1 (+ softmax, which is not important here).

To observe that everything was working correctly, I put a println inside the fit method of Pipeline. I noticed, that the above code does not work as intended (see Expected behaviour and Actual behaviour).

To Reproduce

import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.data.Feature.{FeatureIndex, CategoricalFeature, NumericalFeature}
import io.picnicml.doddlemodel.pipeline.Pipeline.pipe
import io.picnicml.doddlemodel.pipeline.{Pipeline, PipelineTransformers}
import io.picnicml.doddlemodel.preprocessing.{StandardScaler, OneHotEncoder}
import io.picnicml.doddlemodel.linear.SoftmaxClassifier
import io.picnicml.doddlemodel.syntax.PredictorSyntax._

// 3 examples with 1 categorical and 1 numerical feature
val xTr = DenseMatrix(
	List(0.0, 3.7),
	List(5.0, 2.4),
	List(2.0, -0.3)
)
val yTr = DenseVector(0.0, 1.0, 0.0)
val featureIndex = FeatureIndex(List(CategoricalFeature, NumericalFeature))
val transformers: PipelineTransformers = List(
	pipe(OneHotEncoder(featureIndex)),
	pipe(StandardScaler(featureIndex))
)
val pipeline = Pipeline(transformers)(pipe(SoftmaxClassifier()))
val trainedPipeline = pipeline.fit(xTr, yTr)

Expected behavior
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the (now) FIRST (index 0) feature.

Actual behaviour
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the SECOND (index 1) feature.

Versions
Scala: 2.13.0
doddle-model: 0.0.1

Additional context
Prints of partial results (i.e. transformed data):

  • after first transformer is applied:
3.7   1.0  0.0  0.0  0.0  0.0  0.0  
2.4   0.0  0.0  0.0  0.0  0.0  1.0  
-0.3  0.0  0.0  1.0  0.0  0.0  0.0 
  • after second transformer is applied:
3.7   1.1547005383792515   0.0  0.0  0.0  0.0  0.0  
2.4   -0.5773502691896258  0.0  0.0  0.0  0.0  1.0  
-0.3  -0.5773502691896258  0.0  1.0  0.0  0.0  0.0
@inejc inejc added the bug Something isn't working label Sep 12, 2019
@inejc
Copy link
Member

inejc commented Sep 12, 2019

@matejklemen good catch! The problem is that FeatureIndex is not updated by any of the preprocessors. I was thinking we could create a class called Dataset which would essentially encapsulate x, y and featureIndex. That way, we could change the API of transformers to work on Dataset rather than just on x (and implementations would always keep stuff in Dataset in sync). There could be a better way of doing this but it doesn't come to mind ATM, thoughts?

Also, should we also change the API of predictors in that case 🤔?

@matejklemen
Copy link
Member Author

I like the idea, because it seems quite elegant and
(1.) it solves the above problem,
(2.) it solves the problem of preprocessors that we would eventually have to deal with, where we sometimes want to preprocess just the labels, not the features

Some things that we need to be careful about (or that we need to think about):

  • an easy to use API, so that the user doesn't need to wrap their data into too many additional layers
  • should a feature index for the label be added as well?
  • will this bring much overhead?

Another way I can think of solving the above problem is that each Transformer could also implement a featureIndex(), which would return the transformed feature index. The transformed feature index would be set on calling fit() methods of transformers. Inside the pipeline, an additional call to featureIndex() would need to be made to keep track of current data domain.
It does not seem as elegant as your proposed solution and does not solve (2.), but it requires fewer changes. In the long run, I think your solution is better

@inejc
Copy link
Member

inejc commented Sep 18, 2019

an easy to use API, so that the user doesn't need to wrap their data into too many additional layers

I completely agree, we really should be careful not to complicate things too much. Do you think a simple case class with 0 functionality would work? It would really serve just as a wrapper that holds the three values.

should a feature index for the label be added as well?

I think so, yes.

will this bring much overhead?

I was thinking we could simply make x, y and featureIndex on Dataset public and estimators would then simply use dataset.x, etc. That way I don't think performance would be an issue.

Another way I can think of solving the above problem is that each Transformer could also implement a featureIndex(), which would return the transformed feature index. The transformed feature index would be set on calling fit() methods of transformers. Inside the pipeline, an additional call to featureIndex() would need to be made to keep track of current data domain.

Hmm, I like this approach for its simplicity. It could also be a middle step towards the solution I proposed; each transformer would need to implement def featureIndex() which would be required in order to change Dataset as well. We could go down this road I think and add a more complex solution later if it proves useful?

@matejklemen
Copy link
Member Author

Sorry, I've read this and then forgot to reply. 😅

I've taken a look at this, defined two new functions for the Transformer trait

def featureIndex(model: A): FeatureIndex = {
    require(isFitted(model), "Requested modified feature index on a model that is not fitted yet")
    featureIndexSafe(model)
}

protected def featureIndexSafe(model: A): FeatureIndex

and started to add the implementations for concrete transformers. Almost all of them really just pass on the source FeatureIndex, only OneHotEncoder and Binarizer don't.

This brought me to notice another potential inconvenience in OneHotEncoder, which is that it might not be clear how to know which new features belong to which source feature.

Features are processed left-to-right and the new binary features are appended at the end of the data. So assuming a user knows this, they need to then know how many unique values the source feature held and do some "hacking" to select just the new features.

I can add in a sentence to docs regarding how the new binary features are constructed and where they are put. And if a featureIndex(...) function is to be added, maybe the new features can be given a name, derived from the original features? E.g. if the source feature is group, the new features would have names group_0, group_1, ... (indices 0 to number of unique values or maybe the unique values of a feature should be used).

Let me know what you think - this is mostly about convenience really.

@inejc
Copy link
Member

inejc commented Sep 25, 2019

Yeah me too 😅. I think a statement in the docs + a naming convention that you proposed (group_0, group_1, ...) should solve the immediate problem ✅.

@inejc
Copy link
Member

inejc commented Sep 26, 2019

@matejklemen can I assign this to you (to better track/label existing issues)?

@matejklemen
Copy link
Member Author

Yea, sure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants