contamination / information leakage between training and testing data

I notice that both in the `README.md` and `train.php` the same mistake is made:

Namely that `ZScaleStandardizer` is used BEFORE the train test split and not AFTER. This results in information leakage right from the start and puts into question all the various metrics at the end.

The correct approach would be using `ZScaleStandardizer` on the training set only and capturing it's parameters to repeat on the testing set before using the trained model to make predictions.

This can potentially mislead newer users studying machine learning into bad habits that will later need to be unlearned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contamination / information leakage between training and testing data #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

contamination / information leakage between training and testing data #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions