You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
Thanks for the well written code !
I was wondering if you've explored the impact of the number of layers in the position wise MLP in the transformer block. Because if i'm not mistaken, in most of the implementation i saw (like https://github.com/kimiyoung/transformer-xl/tree/master which is cited in the Stabilizing transformer for RL : https://arxiv.org/abs/1910.06764 ) and even in the original transformer paper, they use a MLP with 2 layers and a ReLu between them.
So i was wondering if your choice of having a single layer with a ReLU ( in TransformerBlock : self.fc = nn.Sequential(nn.Linear(embed_dim, embed_dim), nn.ReLU())) is due to empirical tests you've done or works i'm not aware of ?
I'm not aware of any work that study the impact of the architecture of the position wise MLP in the transformer block. Which i guess might be hard to do properly as for example adding a layer changes the total number of parameters.
The text was updated successfully, but these errors were encountered:
It looks like that we oversaw this detail. I'm not really sure if the missing layer will boost performance, but I'll try to test this at some time later this year.
Hi,
Thanks for the well written code !
I was wondering if you've explored the impact of the number of layers in the position wise MLP in the transformer block. Because if i'm not mistaken, in most of the implementation i saw (like https://github.com/kimiyoung/transformer-xl/tree/master which is cited in the Stabilizing transformer for RL : https://arxiv.org/abs/1910.06764 ) and even in the original transformer paper, they use a MLP with 2 layers and a ReLu between them.
So i was wondering if your choice of having a single layer with a ReLU ( in TransformerBlock : self.fc = nn.Sequential(nn.Linear(embed_dim, embed_dim), nn.ReLU())) is due to empirical tests you've done or works i'm not aware of ?
I'm not aware of any work that study the impact of the architecture of the position wise MLP in the transformer block. Which i guess might be hard to do properly as for example adding a layer changes the total number of parameters.
The text was updated successfully, but these errors were encountered: