Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is output vocab including positional embeddings? #47

Open
OanaMariaCamburu opened this issue Nov 21, 2018 · 2 comments
Open

Why is output vocab including positional embeddings? #47

OanaMariaCamburu opened this issue Nov 21, 2018 · 2 comments

Comments

@OanaMariaCamburu
Copy link

Hi,

I was wondering why is the output softmax of dimension n_vocab + n_special + n_ctx as opposed to just n_vocab + n_special? We don't really need to output "tokens" from the positional encodings, do we? I also had a look at some outputs and didn't get negligible values on the last n_ctx lm_logits. Thanks!

@rodgzilla
Copy link
Contributor

Hi,

In the article, the authors use the transpose of the embedding matrix as linear layer just before the softmax layer. This explains the shape of the softmax layer.

@OanaMariaCamburu
Copy link
Author

Thanks for the answer, @rodgzilla, but that shouldn't be the main reason. Just to be able to tie the weights, we can use the sub-part of the embedding matrix that corresponds to the n_vocab tokens. Since we don't ever want to output positional tokens and a quick check shows that the model is putting significant probablity of them:

output1  tensor([[[ -6.4144,  -2.8540, -11.5565,  ...,   4.0194,   3.9853,   3.0146],
         [ -8.3058,  -7.6292, -19.2639,  ...,   1.0065,   1.1153,   0.5842],
         [ -7.3526,  -5.5124, -16.6829,  ...,   1.7537,   1.4622,   1.0649],
         ...,
         [  8.8013,   1.9527, -15.8394,  ...,   1.5357,   1.5170,   1.4211],
         [  8.7922,   1.9531, -15.8468,  ...,   1.5477,   1.5284,   1.4295],
         [  8.7885,   1.9515, -15.8496,  ...,   1.5536,   1.5351,   1.4346]]],
       device='cuda:0')

output1 is the output of the LMHead when the loaded pretrained model is run on the sentence "An apple is a fruit" with n_ctx=64.

The problem is that adding these n_ctx logits to the output vocab creates an incorrect dependency on n_ctx. This is also a reason why I get different results when setting the n_ctx to different larger values. For example, n_ctx=77 in https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py#L214 and I tried different values larger than that, I get different results. For example, with n_ctx=200 I get:
ROCStories Valid Accuracy: 91.18
ROCStories Test Accuracy: 86.10

while without modifying it (n_ctx=77) I get:
ROCStories Valid Accuracy: 90.37
ROCStories Test Accuracy: 86.00

or with n_ctx=100:
ROCStories Valid Accuracy: 90.11
ROCStories Test Accuracy: 86.58

That is almost 1% difference on the validation set, and 0.58% on the test set. Running twice with the same n_ctx gives the same result, so the differences don't seem to come from other sources than n_ctx. I've reported this separately in #45 (comment) and I believe it is due to this output vocabulary containing the positional embeddings.

I will look soon whether taking the subset of the maxtrix would solve the n_ctx dependency problem and let you know.

Best,
Oana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants