Question about attention sink arising in pretrained models #68

kevinli573 · 2023-11-17T02:42:06Z

In section 3.3 of the paper, it mentions 160M parameter models were pretrained. For these models, do you know roughly when during the training, e.g., after how many steps/tokens trained, did the attention sink phenomenon started to arise?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about attention sink arising in pretrained models #68

Question about attention sink arising in pretrained models #68

kevinli573 commented Nov 17, 2023

Question about attention sink arising in pretrained models #68

Question about attention sink arising in pretrained models #68

Comments

kevinli573 commented Nov 17, 2023