Behaviour of OneHotEncoder in a Pipeline #83

matejklemen · 2019-09-11T13:29:58Z

Describe the bug
Consider the code below, which creates a pipeline, where one hot encoder is applied first to the categorical feature and then the numerical feature is normalized to have 0 mean and variance of 1 (+ softmax, which is not important here).

To observe that everything was working correctly, I put a println inside the fit method of Pipeline. I noticed, that the above code does not work as intended (see Expected behaviour and Actual behaviour).

To Reproduce

import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.data.Feature.{FeatureIndex, CategoricalFeature, NumericalFeature}
import io.picnicml.doddlemodel.pipeline.Pipeline.pipe
import io.picnicml.doddlemodel.pipeline.{Pipeline, PipelineTransformers}
import io.picnicml.doddlemodel.preprocessing.{StandardScaler, OneHotEncoder}
import io.picnicml.doddlemodel.linear.SoftmaxClassifier
import io.picnicml.doddlemodel.syntax.PredictorSyntax._

// 3 examples with 1 categorical and 1 numerical feature
val xTr = DenseMatrix(
	List(0.0, 3.7),
	List(5.0, 2.4),
	List(2.0, -0.3)
)
val yTr = DenseVector(0.0, 1.0, 0.0)
val featureIndex = FeatureIndex(List(CategoricalFeature, NumericalFeature))
val transformers: PipelineTransformers = List(
	pipe(OneHotEncoder(featureIndex)),
	pipe(StandardScaler(featureIndex))
)
val pipeline = Pipeline(transformers)(pipe(SoftmaxClassifier()))
val trainedPipeline = pipeline.fit(xTr, yTr)

Expected behavior
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the (now) FIRST (index 0) feature.

Actual behaviour
The first feature is removed and encoded by 6 new columns. Then, standard scaler is applied on the SECOND (index 1) feature.

Versions
Scala: 2.13.0
doddle-model: 0.0.1

Additional context
Prints of partial results (i.e. transformed data):

after first transformer is applied:

3.7   1.0  0.0  0.0  0.0  0.0  0.0  
2.4   0.0  0.0  0.0  0.0  0.0  1.0  
-0.3  0.0  0.0  1.0  0.0  0.0  0.0

after second transformer is applied:

3.7   1.1547005383792515   0.0  0.0  0.0  0.0  0.0  
2.4   -0.5773502691896258  0.0  0.0  0.0  0.0  1.0  
-0.3  -0.5773502691896258  0.0  1.0  0.0  0.0  0.0

The text was updated successfully, but these errors were encountered:

inejc · 2019-09-12T07:40:39Z

@matejklemen good catch! The problem is that FeatureIndex is not updated by any of the preprocessors. I was thinking we could create a class called Dataset which would essentially encapsulate x, y and featureIndex. That way, we could change the API of transformers to work on Dataset rather than just on x (and implementations would always keep stuff in Dataset in sync). There could be a better way of doing this but it doesn't come to mind ATM, thoughts?

Also, should we also change the API of predictors in that case 🤔?

matejklemen · 2019-09-12T12:22:07Z

I like the idea, because it seems quite elegant and
(1.) it solves the above problem,
(2.) it solves the problem of preprocessors that we would eventually have to deal with, where we sometimes want to preprocess just the labels, not the features

Some things that we need to be careful about (or that we need to think about):

an easy to use API, so that the user doesn't need to wrap their data into too many additional layers
should a feature index for the label be added as well?
will this bring much overhead?

Another way I can think of solving the above problem is that each Transformer could also implement a featureIndex(), which would return the transformed feature index. The transformed feature index would be set on calling fit() methods of transformers. Inside the pipeline, an additional call to featureIndex() would need to be made to keep track of current data domain.
It does not seem as elegant as your proposed solution and does not solve (2.), but it requires fewer changes. In the long run, I think your solution is better

inejc · 2019-09-18T08:42:45Z

an easy to use API, so that the user doesn't need to wrap their data into too many additional layers

I completely agree, we really should be careful not to complicate things too much. Do you think a simple case class with 0 functionality would work? It would really serve just as a wrapper that holds the three values.

should a feature index for the label be added as well?

I think so, yes.

will this bring much overhead?

I was thinking we could simply make x, y and featureIndex on Dataset public and estimators would then simply use dataset.x, etc. That way I don't think performance would be an issue.

Another way I can think of solving the above problem is that each Transformer could also implement a featureIndex(), which would return the transformed feature index. The transformed feature index would be set on calling fit() methods of transformers. Inside the pipeline, an additional call to featureIndex() would need to be made to keep track of current data domain.

Hmm, I like this approach for its simplicity. It could also be a middle step towards the solution I proposed; each transformer would need to implement def featureIndex() which would be required in order to change Dataset as well. We could go down this road I think and add a more complex solution later if it proves useful?

matejklemen · 2019-09-25T15:41:27Z

Sorry, I've read this and then forgot to reply. 😅

I've taken a look at this, defined two new functions for the Transformer trait

def featureIndex(model: A): FeatureIndex = {
    require(isFitted(model), "Requested modified feature index on a model that is not fitted yet")
    featureIndexSafe(model)
}

protected def featureIndexSafe(model: A): FeatureIndex

and started to add the implementations for concrete transformers. Almost all of them really just pass on the source FeatureIndex, only OneHotEncoder and Binarizer don't.

This brought me to notice another potential inconvenience in OneHotEncoder, which is that it might not be clear how to know which new features belong to which source feature.

Features are processed left-to-right and the new binary features are appended at the end of the data. So assuming a user knows this, they need to then know how many unique values the source feature held and do some "hacking" to select just the new features.

I can add in a sentence to docs regarding how the new binary features are constructed and where they are put. And if a featureIndex(...) function is to be added, maybe the new features can be given a name, derived from the original features? E.g. if the source feature is group, the new features would have names group_0, group_1, ... (indices 0 to number of unique values or maybe the unique values of a feature should be used).

Let me know what you think - this is mostly about convenience really.

inejc · 2019-09-25T15:52:01Z

Yeah me too 😅. I think a statement in the docs + a naming convention that you proposed (group_0, group_1, ...) should solve the immediate problem ✅.

inejc · 2019-09-26T21:26:35Z

@matejklemen can I assign this to you (to better track/label existing issues)?

matejklemen · 2019-09-26T22:45:13Z

Yea, sure :)

inejc added the bug Something isn't working label Sep 12, 2019

inejc assigned matejklemen Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour of OneHotEncoder in a Pipeline #83

Behaviour of OneHotEncoder in a Pipeline #83

matejklemen commented Sep 11, 2019

inejc commented Sep 12, 2019

matejklemen commented Sep 12, 2019

inejc commented Sep 18, 2019

matejklemen commented Sep 25, 2019

inejc commented Sep 25, 2019

inejc commented Sep 26, 2019

matejklemen commented Sep 26, 2019

Behaviour of OneHotEncoder in a Pipeline #83

Behaviour of OneHotEncoder in a Pipeline #83

Comments

matejklemen commented Sep 11, 2019

inejc commented Sep 12, 2019

matejklemen commented Sep 12, 2019

inejc commented Sep 18, 2019

matejklemen commented Sep 25, 2019

inejc commented Sep 25, 2019

inejc commented Sep 26, 2019

matejklemen commented Sep 26, 2019