Model Training Domain Knowledge

Jump to bottom

Sherlock edited this page Mar 12, 2021 · 2 revisions

ML Knowledge

Understand the meaning and implications of common configurations: batch size, seq len, learning rate, weight decay, global norm, loss scale...
Familiarize with the common patterns in loss decreasing curve, spot abnormal patterns
Understand the difference between optimizers: SGD, Adam and LAMB
Advance: Understanding Backpropagation https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Know-hows

Familiar with running/monitoring AML experiments
Familiarize with setting up tensorboard
Action: submit a distributed training job to AML cluster and get familiar with it's user interface/logging/available metrics

Convergence Investigation

Remove all randomness in the program
- Set Seeds
- Set Dropout Ratio to 0
- Set use_deterministice_compute=True
- Disable dataloader shuffling
Shrink the reproducible condition to the very minimal, as long as it can still repro
- Use 1 layer model
- Use smaller hidden_size
- Use single GPU
- ...
Common Tricks
- Set the learning rate to 0 to disable model change
Advance: how to do hyper-parameter tuning to get the model to converge better?

Action: Train a model E2E to get hands-on experience