differences vs. original paper #102

soldierofhell · 2019-11-18T14:26:21Z

Hi @zzh8829,
Thanks for your work. I must admit I never dig into original darknet implementation, but after getting back to the original paper I noticed two inconsistencies (?):

box loss: in your code

According to paper

see
scales of anchors: in your code all anchors are on the same grid 13x13, but in the model there are 3 scales. Shouldn't we use more granular scales for smaller anchors? (similar as in RetinaNet)

Regards,

zzh8829 · 2019-12-21T12:33:45Z

Hi great to see you went deep into the paper.

The implementation here true to paper author's original code
You can see here
https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L141
He used sigmoid function on x,y coordinate
and then loss function also uses the same x,y after sigmoid
https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L98

I guess you can interpret this as x,y are a result of logistic regression on center coordinate offset. You
can see in section 4 of the paper "4. Things We Tried That Didn’t Work" he mentioned that "Linear x, y predictions instead of logistic." decreases accuracy. That is the reason we use SSE loss on logits instead.

There are indeed 3 different scales in my code.
The grid sizes are determined automatically based on conv layer's output size.
You can see the here

yolov3-tf2/yolov3_tf2/dataset.py

Line 68 in 78ff113

grid_size *= 2

Each of the 3 scale in fact has different grid size.

nicolefinnie · 2019-12-26T09:23:33Z

@zzh8829 Speaking of which, two questions

Is there a reason why you encoded the target using a single class instead of one-hot, which is the output of the network? To save memory? It only matters during calculating the loss function anyway, and it seems either way would work with sparse_categorical_crossentropy()
Also, I would encode the box in the target in this format t_x, t_y, t_w, t_h, as the paper suggested, same as the network output, so there would be less code in the loss function. Is there a reason why you encoded the target in the original bbox format xmin, ymin, xmax, ymax and converted it in the loss function? which also differs from how Tensorflow API parses the box in order of ymin, xmin, ymax, ymin.

Thanks! :)

zzh8829 · 2019-12-30T00:23:38Z

Mathematically speaking they are the same formula but as this post pointed out https://datascience.stackexchange.com/a/41923 , when the categories are mutually exclusive, sparse crossentrophy is much more performant.
The encoding order of targets doesn't really matter. The end result from training is the same and the dataset is saved in tfrecord which is indexed by name not array order. The reason I didn't do t_x,t_y in the dataset transform is because t_* depends on the network output size and anchor size. It's more convenient to calculate them in loss function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

differences vs. original paper #102

differences vs. original paper #102

soldierofhell commented Nov 18, 2019

zzh8829 commented Dec 21, 2019

nicolefinnie commented Dec 26, 2019

zzh8829 commented Dec 30, 2019

differences vs. original paper #102

differences vs. original paper #102

Comments

soldierofhell commented Nov 18, 2019

zzh8829 commented Dec 21, 2019

nicolefinnie commented Dec 26, 2019

zzh8829 commented Dec 30, 2019