Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

differences vs. original paper #102

Open
soldierofhell opened this issue Nov 18, 2019 · 3 comments
Open

differences vs. original paper #102

soldierofhell opened this issue Nov 18, 2019 · 3 comments

Comments

@soldierofhell
Copy link

Hi @zzh8829,
Thanks for your work. I must admit I never dig into original darknet implementation, but after getting back to the original paper I noticed two inconsistencies (?):

  1. box loss: in your code
    image
    According to paper
    image
    see
    image

  2. scales of anchors: in your code all anchors are on the same grid 13x13, but in the model there are 3 scales. Shouldn't we use more granular scales for smaller anchors? (similar as in RetinaNet)

image

Regards,

@zzh8829
Copy link
Owner

zzh8829 commented Dec 21, 2019

Hi great to see you went deep into the paper.

  1. The implementation here true to paper author's original code
    You can see here
    https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L141
    He used sigmoid function on x,y coordinate
    and then loss function also uses the same x,y after sigmoid
    https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L98

I guess you can interpret this as x,y are a result of logistic regression on center coordinate offset. You
can see in section 4 of the paper "4. Things We Tried That Didn’t Work" he mentioned that "Linear x, y predictions instead of logistic." decreases accuracy. That is the reason we use SSE loss on logits instead.

  1. There are indeed 3 different scales in my code.
    The grid sizes are determined automatically based on conv layer's output size.
    You can see the here
    grid_size *= 2

    Each of the 3 scale in fact has different grid size.

@nicolefinnie
Copy link

@zzh8829 Speaking of which, two questions

  • Is there a reason why you encoded the target using a single class instead of one-hot, which is the output of the network? To save memory? It only matters during calculating the loss function anyway, and it seems either way would work with sparse_categorical_crossentropy()

  • Also, I would encode the box in the target in this format t_x, t_y, t_w, t_h, as the paper suggested, same as the network output, so there would be less code in the loss function. Is there a reason why you encoded the target in the original bbox format xmin, ymin, xmax, ymax and converted it in the loss function? which also differs from how Tensorflow API parses the box in order of ymin, xmin, ymax, ymin.

Thanks! :)

@zzh8829
Copy link
Owner

zzh8829 commented Dec 30, 2019

  1. Mathematically speaking they are the same formula but as this post pointed out https://datascience.stackexchange.com/a/41923 , when the categories are mutually exclusive, sparse crossentrophy is much more performant.

  2. The encoding order of targets doesn't really matter. The end result from training is the same and the dataset is saved in tfrecord which is indexed by name not array order. The reason I didn't do t_x,t_y in the dataset transform is because t_* depends on the network output size and anchor size. It's more convenient to calculate them in loss function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants