-
Notifications
You must be signed in to change notification settings - Fork 0
Model and strategy optimization
I'll be listing major steps, advances in my parameter optimization, RMSE scores displayed here are not representative of the final evaluation, refer to Final Evaluation page for proper scores. Time-series split used are always 6.
MIMO strategies had pred_len set to 3, since we want to predict 3 hours forward.
Score (RMSE) | estimators | max depth | min samples split | max features | max train size | test size | covid? | Notes |
---|---|---|---|---|---|---|---|---|
106.061 | 300 | 15 | 2 | 0.5 | unlimited | unlimited | No | Grid search result |
106.423 | 150 | 15 | 2 | 0.5 | 33.3% | unlimited | No | |
106.027 | 150 | 15 | 2 | 0.5 | 50% | unlimited | Yes | |
127.438 | 150 | 15 | 2 | 0.5 | 50% | 2 months | Yes | |
119.591 | 150 | 25 | 2 | 0.5 | 33.3% | 6 months | Yes | |
103.296 | 150 | 50 | unset | 0.75 | unlimited | unlimited | No | Best so far |
Generally speaking 2nd folds perform the best for example on the last row of the table:
- 2nd split ran with a test error of 69.6
- 5th and 6th split ran with a test error of approx. 129.7
Score (RMSE) | epochs | batch size | lr | model | model dropout | Notes |
---|---|---|---|---|---|---|
177.035 | 200 | 64 | 0.0005 | CNLessPad | 0.5 | |
169.244 | 200 | 64 | 0.001 | CNLong | 0.5 | noise, t-48 lookback |
148.006 | 400 | 2048 | 0.001 | CNLong | 0.5 | Final, longer early stop |
Score (RMSE) | batch_size | lr | hidden_size | num_layers | dropout | noise | bidirectional | Notes |
---|---|---|---|---|---|---|---|---|
124.755 | 128 | 0.001 | 15 | 2 | 0.0 | 0.0 | True | Initial grid search |
109.981* | 128 | 0.001 | 20 | 3 | 0.3 | 0.0 | True | |
109.007* | 128 | 0.001 | 20 | 2 | 0.3 | 0.0 | True | |
96.847* | 128 | 0.0001 | 20 | 3 | 0.3 | 0.0 | True | Lr tweaking |
94.576* | 128 | 0.0001 | 20 | 3 | 0.3 | 0.05 | True | |
93.393* | 2048 | 0.001 | 20 | 3 | 0.3 | 0.05 | True | Final |
* Score displayed is without first split, since the LSTM model fit very poorly on low amount of datapoints.
Epochs aren't displayed (were set to 1000 since I used early stopping), best model fit under 300 epochs. Time-series splits used are always 6.
Score (RMSE) | batch_size | lr | num_channels (nc) | kernel_size | dropout | noise | Notes |
---|---|---|---|---|---|---|---|
97.900 | 128 | 0.0001 | (72,) * 4 | 5 | 0.3 | 0.0 | Initial grid search |
101.900 | 128 | 0.0001 | (100,) * 4 | 9 | 0.3 | 0.0 | dropout and kernel_size tweaking |
99.457 | 128 | 0.0001 | (72,) * 4 | 5 | 0.3 | 0.05 | Noise test |
97.350 | 2048 | 0.001 | (72,) * 4 | 5 | 0.3 | 0.05 | Final |
Used t-48 lookback, since regular CNNs benefitted from it.
Epochs aren't displayed (were set to 1000 since I used early stopping), best model fit under 300 epochs.
This model is less consistent than an LSTM, but can provide better scores in certain cases similar to the Random Forest Model. Noise helps somewhat stabilize this behaviour.
Score (RMSE) | batch_size | lr | embedding_size | num_layers | bidirectional | noise | Notes |
---|---|---|---|---|---|---|---|
111.890* | 128 | 0.0005 | 24 | 1 | True | 0.0 | Initial |
94.367* | 128 | 0.0005 | 12 | 1 | True | 0.0 | Smaller embedding |
93.729* | 128 | 0.0005 | 10 | 1 | True | 0.0 | |
88.600* | 2048 | 0.001 | 10 | 1 | True | 0.05 |
* Score displayed is without first split, since the GRU encoder-decoder model fit poorly on low amount of datapoints.
Dropout was always set to 0.5. Epochs aren't displayed (were set to 1000 since I used early stopping), best model fit under 300 epochs.
Score (RMSE) | hidden_size | num_layers | dropout | noise | bidirectional | Notes |
---|---|---|---|---|---|---|
147.854* | 40 | 3 | 0.5 | 0.05 | True | Initial grid search |
133.181* | 60 | 3 | 0.3 | 0.0 | True | |
123.977* | 70 | 4 | 0.3 | 0.0 | True | |
128.038* | 80 | 4 | 0.3 | 0.0 | True |
* Score displayed is without first split, since the GRU model fit poorly on low amount of datapoints.
LSTM model also tested, GRU came out on top, pred_len set to feature count here (11). Batch size is 2048, learning rate is 0.001.
I grid searched small configurations of CNN, TCN, LSTM and GRU for prec and grad features. I'll list the best model and params for both.
prec, TCN
Score (RMSE) | batch_size | lr | num_channels (nc) | kernel_size | dropout | noise |
---|---|---|---|---|---|---|
0.105 | 2048 | 0.001 | (32,) * 2 | 5 | 0.5 | 0.05 |
grad, CNN
Feature | Score (RMSE) | batch_size | lr | channels | kernel_sizes | dropout | noise |
---|---|---|---|---|---|---|---|
grad | 15.851 | 2048 | 0.0005 | (16, 32) | (6, 12) | 0.5 | 0.05 |
For el_load, I started by optimizing the 1 hour prediction first (pred_len = 1). The GRU model outperformed the LSTM here too.
Score (RMSE) | hidden_size | num_layers | dropout | noise | bidirectional | Notes |
---|---|---|---|---|---|---|
67.627* | 25 | 2 | 0.5 | 0.05 | True | Multi-layer |
59.354* | 40 | 1 | 0.3 | 0.0 | True | Single-layer |
* Score displayed is without first split, since the GRU and LSTM models fit poorly on low amount of datapoints.
I decided to test both models further on the assumption that multi-layer models might handle the noise introduced by recursive predictions better. Next table is the recursive predictions with everything combined, for model_definition refer to the notebooks. Only model being optimized at this point is the GRU. (pred_len=3)
Score (RMSE) | hidden_size | num_layers | dropout | noise | bidirectional | Notes |
---|---|---|---|---|---|---|
92.993* | 30 | 2 | 0.5 | 0.05 | True | Multi-layer |
93.265* | 50 | 1 | 0.3 | 0.0 | True | Single-layer |
Both models performed close to each other, I'll be taking both to the final evaluation.