Validation loss jumps on an epoch boundary? #19121

m-harmonic · 2023-12-06T18:12:42Z

m-harmonic
Dec 6, 2023

I'm using Lightning to do some LLM finetuning and have run into an issue where the validation loss quickly increases immediately after epoch boundaries. Wondering if anyone has seen an issue similar to this before.

The problem is shown in the following two images:

Image 1 - the epoch size in this case is 3000 steps, which is set using limit_train_batches. The entire training dataset is larger than this. There is a clear discontinuity in the validation loss at 6k and 9k steps, which are the beginnings of epochs 2 and 3.

Image 2 - the epoch size in this case is roughly 6,200 steps. We can see a strange staircasing pattern where validation loss jumps at the beginning of an epoch, improves as training occurs during the epoch, and finally jumps up again at the beginning of the next epoch.

Some other observations

this is single-node, multi-GPU training with default strategy
the problem occurs on both bf16 and 32-bit training
the problem occurs when using various optimizers such as AdamW, Adam, and RMSprop
I have verified the data samples are fine in the batches both before and after the epoch boundary. The fact that epoch boundary causes the issue rather than when the dataset starts to repeat also indicates it is not a data issue.
I have checked for unusual or extreme values in the scaled and unscaled gradients and optimizer state such as momentum, around the epoch boundary steps
detect_anomaly was enabled and did not trigger an error
It is not a logging issue as I have manually inspected the validation loss values computed for each GPU and batch

If anyone has seen something similar, or has suggestions on what could be causing this, I'm very open to comments. Thanks!

Honzys · 2024-03-19T23:24:19Z

Honzys
Mar 19, 2024

Hello, have you find the culprit of this behaviour?

0 replies

davidsvaughn · 2024-08-21T21:29:27Z

davidsvaughn
Aug 21, 2024

This is usually accompanied by the same "staircasing" effect in the training loss, but in the opposite (downward) direction.
See: https://discuss.huggingface.co/t/why-my-training-loss-drops-at-epoch-boundaries/14431/2

Apparently turning off train data re-shuffling (re-shuffling after each epoch) can eliminate this effect, but it's unclear why.
I have verified this solution works, at least on my data. However, not sure if it has any negative side-effects, since re-shuffling is generally considered a good idea with SGD (and variants) optimization. I thought it helped not get stuck in local minima, so I wonder if the min-loss would be higher without re-shuffling...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation loss jumps on an epoch boundary? #19121

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Validation loss jumps on an epoch boundary? #19121

m-harmonic Dec 6, 2023

Replies: 2 comments

Honzys Mar 19, 2024

davidsvaughn Aug 21, 2024

m-harmonic
Dec 6, 2023

Honzys
Mar 19, 2024

davidsvaughn
Aug 21, 2024