When using the imglist file to load the data,If the number of data set label isn't the integer multiple batch size,the loss become nan #14822
Replies: 2 comments
-
@canteen-man Could you please provide some code that we can reproduce your problem? |
Beta Was this translation helpful? Give feedback.
-
@lanking520 ,the lst file is such as : mod=mx.mod.Module(symbol= xxx,context = xxx,data_names=['xxx'],label_names=['xxx']) mod.bind(data_shapes=training_iter.provide_data,label_shapes=training_iter.provide_label) mod.init_params(mx.initializer.Xavier()) lr_sch = mx.lr_scheduler.FactorScheduler(step=2000, factor=0.5) mod.fit(train_data=training_iter,optimizer='sgd',optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch)),eval_metric='mse',num_epoch=500,epoch_end_callback=checkpoint) And I print the label in the fit function of the base_module.py by use: ########################################################################### |
Beta Was this translation helpful? Give feedback.
-
Description
I regression the label and the label width is just a float number. I use the imglist to load my image and label.When the total number of label aren't the integer multiple of the batch size ,the label become the strange value,I don’t sure if it is a random number.I print the label in the data iter,some of them are so big,others are so small,I‘m sure these number are't my label in lst file.And due to these number,the loss become nan.
I don't shuffle the label and the image.I find when it iter to the last of the data set, at the first part of the last batch in this epoch are the correct number,and the second half label should be the head of the label of all data set,but they are become the strange value.
When I set the batch size which the total number of label are the integer multiple of the batch size ,the train loss iteration turn normal.
I research the code,maybe the question is in the detection.py,the next_sample function and the next function are about load the imglist.
It seem like declare a batch label which size is same as the batch size length.Then go through the lable of imglist file to assignment the batch label,erery step is the batch size length.So the last of label don’t get correct number.
I don't sure whether this question is because of this,this mistake is too simple to feel possible.But this question real occur.
I have already slove this problem by set the number of the batch size which the total number of data set label is the integer multiple of batch size.
Environment info (Required)
mxnet 1.2.1
CUDA 9.1
UBUNTU
Package used (Python/R/Scala/Julia):
python 3.7
Beta Was this translation helpful? Give feedback.
All reactions