-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is output vocab including positional embeddings? #47
Comments
Hi, In the article, the authors use the transpose of the embedding matrix as linear layer just before the softmax layer. This explains the shape of the softmax layer. |
Thanks for the answer, @rodgzilla, but that shouldn't be the main reason. Just to be able to tie the weights, we can use the sub-part of the embedding matrix that corresponds to the n_vocab tokens. Since we don't ever want to output positional tokens and a quick check shows that the model is putting significant probablity of them:
output1 is the output of the LMHead when the loaded pretrained model is run on the sentence "An apple is a fruit" with n_ctx=64. The problem is that adding these n_ctx logits to the output vocab creates an incorrect dependency on n_ctx. This is also a reason why I get different results when setting the n_ctx to different larger values. For example, n_ctx=77 in https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py#L214 and I tried different values larger than that, I get different results. For example, with n_ctx=200 I get: while without modifying it (n_ctx=77) I get: or with n_ctx=100: That is almost 1% difference on the validation set, and 0.58% on the test set. Running twice with the same n_ctx gives the same result, so the differences don't seem to come from other sources than n_ctx. I've reported this separately in #45 (comment) and I believe it is due to this output vocabulary containing the positional embeddings. I will look soon whether taking the subset of the maxtrix would solve the n_ctx dependency problem and let you know. Best, |
Hi,
I was wondering why is the output softmax of dimension n_vocab + n_special + n_ctx as opposed to just n_vocab + n_special? We don't really need to output "tokens" from the positional encodings, do we? I also had a look at some outputs and didn't get negligible values on the last n_ctx lm_logits. Thanks!
The text was updated successfully, but these errors were encountered: