Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Number of layers of the position wise MLP in transformer block. #18

Open
Reytuag opened this issue Nov 22, 2023 · 1 comment

Comments

@Reytuag
Copy link
Contributor

Reytuag commented Nov 22, 2023

Hi,
Thanks for the well written code !
I was wondering if you've explored the impact of the number of layers in the position wise MLP in the transformer block. Because if i'm not mistaken, in most of the implementation i saw (like https://github.com/kimiyoung/transformer-xl/tree/master which is cited in the Stabilizing transformer for RL : https://arxiv.org/abs/1910.06764 ) and even in the original transformer paper, they use a MLP with 2 layers and a ReLu between them.

So i was wondering if your choice of having a single layer with a ReLU ( in TransformerBlock : self.fc = nn.Sequential(nn.Linear(embed_dim, embed_dim), nn.ReLU())) is due to empirical tests you've done or works i'm not aware of ?
I'm not aware of any work that study the impact of the architecture of the position wise MLP in the transformer block. Which i guess might be hard to do properly as for example adding a layer changes the total number of parameters.

@MarcoMeter
Copy link
Owner

Thanks @Reytuag for bringing up your question!

It looks like that we oversaw this detail. I'm not really sure if the missing layer will boost performance, but I'll try to test this at some time later this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants