-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Model.fit - BaseCollectiveExecutor::StartAbort Out of range: End of sequence #144
Comments
Reading your tutorial you encourage the use of transfer from darknet as opposed to none. Does this mean that I should avoid overlapping classes from my training dataset and darknet or will they merge? |
the out of range error shouldn't affect training tensorflow/tensorflow#31509 regarding transfer learning: now for your issue its hard to tell without context of your dataset. Are you just training with raw coco or some custom images? Can you share the number of images/classes in your dataset and your loss output? |
Thanks. I better understand it now. So the Darknet weights are being applied to the network before we begin training. Much clearer. I am running the same training data (COCO) with --transfer=darknet and --mode=eager_fit and I am not getting any errors at all. I am running 20 epoch and it will be at least tomorrow before it finishes so I don't know yet how well the results will work in detection. I will try again to run with --mode=fit after while continuing to use the transfer with darknet weights. |
Completed 20 epochs of training on the COCO data. No errors until the end at which I got this ... When I try to run detect_video.py referencing the last epoch checkpoint for weights I get similar errors as follows; WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer Any idea what I am doing wrong? |
I apologize as I overlooked your reuests. I will shrink the tfrecord and collect my loss output and provide it for your reference as soon as I am back in the office. |
Any update on this? One reason I thought that it could be possible for this error is that your total number of images is not divisible by your batch size. For example, if your dataset has 1005 images and 8 as batch size, the last batch will only have 5 images. @zzh8829 ? |
This issue was posted and closed recently however there is no clear resolution for me to follow. I've tried training with VOC and COCO datasets compiled appropriately into TFRecord datasets. I've used the visualize tools recently posted (most helpful) to validate my TFRecord entries.
I am running TF 2.0
CUDA 10.1.243
cuDNN 7.6.5
python 3.7.0
Ubuntu 18.04 LTS
NVIDIA TITAN RTX
python train.py --dataset=../COCO/images/train.tfrecord --val_dataset=../COCO/images/test.tfrecord --weights=./checkpoints/yolov3.tf --classes=../COCO/images/coco.names --mode=fit --transfer=darknet --epochs=2 --num_classes=80
I get these errors at the end of each epoch ...
tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
2020-01-05 18:27:37.863969: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[Shape/_10]]
2020-01-05 18:30:26.995630: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[IteratorGetNext/_2]]
2020-01-05 18:30:26.995689: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
My loss calculations look fine but training results in no detection.
I've tried --transfer=darknet and get the same errors PLUS unresolved errors as follows ... WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8
W0105 19:28:00.213585 139785603508032 util.py:144] Unresolved object in checkpoint: (root).layer-8
In the previous post the final comment was "random weights not recommended". I am not able to interpret what this means with respect to the training parameters or inputs. How can I solve this issue? Any thoughts on what could be the cause?
The text was updated successfully, but these errors were encountered: