Question: Number of layers of the position wise MLP in transformer block. #18

Reytuag · 2023-11-22T17:42:52Z

Hi,
Thanks for the well written code !
I was wondering if you've explored the impact of the number of layers in the position wise MLP in the transformer block. Because if i'm not mistaken, in most of the implementation i saw (like https://github.com/kimiyoung/transformer-xl/tree/master which is cited in the Stabilizing transformer for RL : https://arxiv.org/abs/1910.06764 ) and even in the original transformer paper, they use a MLP with 2 layers and a ReLu between them.

So i was wondering if your choice of having a single layer with a ReLU ( in TransformerBlock : self.fc = nn.Sequential(nn.Linear(embed_dim, embed_dim), nn.ReLU())) is due to empirical tests you've done or works i'm not aware of ?
I'm not aware of any work that study the impact of the architecture of the position wise MLP in the transformer block. Which i guess might be hard to do properly as for example adding a layer changes the total number of parameters.

MarcoMeter · 2023-11-23T09:26:15Z

Thanks @Reytuag for bringing up your question!

It looks like that we oversaw this detail. I'm not really sure if the missing layer will boost performance, but I'll try to test this at some time later this year.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Number of layers of the position wise MLP in transformer block. #18

Question: Number of layers of the position wise MLP in transformer block. #18

Reytuag commented Nov 22, 2023

MarcoMeter commented Nov 23, 2023

Question: Number of layers of the position wise MLP in transformer block. #18

Question: Number of layers of the position wise MLP in transformer block. #18

Comments

Reytuag commented Nov 22, 2023

MarcoMeter commented Nov 23, 2023