Trainloss：Very high volatility in loss #13

pnaclcu · 2023-03-30T02:27:04Z

Hello， thanks for your codes. They are elegant and clear. These codes help me a lot.
I got a problem as the training loss performed very well about 0.001 at the beginning of the training.
The default end epoch is set as 10000. But the training loss will get a surprising number about "Training Loss : 325440.0592" at 2000+ epochs. I am curious. Have you ever encountered this issue before？
The training batch size is 96 with 4 GPUs with PyTorch.DDP. Since the full training data set only includes about 4000 images, 4 GPUs only need about 10 iterations to end an epoch. Do you think this is the reason?
Thanks for your codes.

yuan5828225 · 2023-04-03T08:40:46Z

I have the same problem. Have you solved the problem？

ElvinChan777 · 2023-04-10T10:47:01Z

I have the same problem. Have you solved the problem？

Hi bro. Have you solved the promblm?

yuan5828225 · 2023-04-10T11:55:37Z

I have the same problem. Have you solved the problem？

Hi bro. Have you solved the promblm?

Not yet . Testing with the results of the 10,000th round shows very poor results. It is ok to use the lowest loss before the fluctuation, although the result is not good, probably due to the small size of my dataset, I am trying to adjust the parameters

Alan-Py · 2023-04-15T04:06:25Z

Hey,guys. Have you solved the promblm?

pnaclcu · 2023-04-15T06:34:50Z

I have the same problem. Have you solved the problem？

Hi bro. Have you solved the promblm?

Not yet . Testing with the results of the 10,000th round shows very poor results. It is ok to use the lowest loss before the fluctuation, although the result is not good, probably due to the small size of my dataset, I am trying to adjust the parameters

Hey guys. I got the solution. Add a scheduler to control the learning rate.
E.g. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.2, patience=50,
verbose=True, min_lr=1e-6)
scheduler.step(THE LOSS YOU DEFINED)
But the loss seems that the epoch_loss denotes the batch_loss in the driver.py. So I rewrited the loss.
gl ^^

Alan-Py · 2023-04-15T13:11:17Z

@pnaclcu Good job! Can you share your loss code?

nhthanh0809 · 2023-07-07T08:27:05Z

Hi bros,
My loss after each epoc is nan (loss value after some batch is nan)
I checked input data (image and mask) but there are no problem with data.
Does anyone have the same problem with me?

ChenqinWu · 2023-12-03T11:01:45Z

we can set the parameter args.scale_lr == False to solve this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainloss：Very high volatility in loss #13

Trainloss：Very high volatility in loss #13

pnaclcu commented Mar 30, 2023

yuan5828225 commented Apr 3, 2023

ElvinChan777 commented Apr 10, 2023

yuan5828225 commented Apr 10, 2023

Alan-Py commented Apr 15, 2023

pnaclcu commented Apr 15, 2023

Alan-Py commented Apr 15, 2023

nhthanh0809 commented Jul 7, 2023

ChenqinWu commented Dec 3, 2023

Trainloss：Very high volatility in loss #13

Trainloss：Very high volatility in loss #13

Comments

pnaclcu commented Mar 30, 2023

yuan5828225 commented Apr 3, 2023

ElvinChan777 commented Apr 10, 2023

yuan5828225 commented Apr 10, 2023

Alan-Py commented Apr 15, 2023

pnaclcu commented Apr 15, 2023

Alan-Py commented Apr 15, 2023

nhthanh0809 commented Jul 7, 2023

ChenqinWu commented Dec 3, 2023