-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Model Training Domain Knowledge
Sherlock edited this page Mar 12, 2021
·
2 revisions
- Understand the meaning and implications of common configurations: batch size, seq len, learning rate, weight decay, global norm, loss scale...
- Familiarize with the common patterns in loss decreasing curve, spot abnormal patterns
- Understand the difference between optimizers: SGD, Adam and LAMB
- Advance: Understanding Backpropagation https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
- Familiar with running/monitoring AML experiments
- Familiarize with setting up tensorboard
- Action: submit a distributed training job to AML cluster and get familiar with it's user interface/logging/available metrics
-
Remove all randomness in the program
- Set Seeds
- Set Dropout Ratio to 0
- Set use_deterministice_compute=True
- Disable dataloader shuffling
-
Shrink the reproducible condition to the very minimal, as long as it can still repro
- Use 1 layer model
- Use smaller hidden_size
- Use single GPU
- ...
-
Common Tricks
- Set the learning rate to 0 to disable model change
-
Advance: how to do hyper-parameter tuning to get the model to converge better?
Please use the learning roadmap on the home wiki page for building general understanding of ORT.