Validation loss jumps on an epoch boundary? #19121
Replies: 2 comments
-
Hello, have you find the culprit of this behaviour? |
Beta Was this translation helpful? Give feedback.
-
This is usually accompanied by the same "staircasing" effect in the training loss, but in the opposite (downward) direction. Apparently turning off train data re-shuffling (re-shuffling after each epoch) can eliminate this effect, but it's unclear why. |
Beta Was this translation helpful? Give feedback.
-
I'm using Lightning to do some LLM finetuning and have run into an issue where the validation loss quickly increases immediately after epoch boundaries. Wondering if anyone has seen an issue similar to this before.
The problem is shown in the following two images:
Image 1 - the epoch size in this case is 3000 steps, which is set using
limit_train_batches
. The entire training dataset is larger than this. There is a clear discontinuity in the validation loss at 6k and 9k steps, which are the beginnings of epochs 2 and 3.Image 2 - the epoch size in this case is roughly 6,200 steps. We can see a strange staircasing pattern where validation loss jumps at the beginning of an epoch, improves as training occurs during the epoch, and finally jumps up again at the beginning of the next epoch.
Some other observations
detect_anomaly
was enabled and did not trigger an errorIf anyone has seen something similar, or has suggestions on what could be causing this, I'm very open to comments. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions