GitHub - tarowatanabe/Transformer-Clinic: Understanding the Difficulty of Training Transformers

Admin

Understanding the Difficulty of Training Transformers

We are in an early-release beta. Expect some adventures and rough edges.

Introduction

What complicates Transformer training?

In our study, we go beyond gradient vanishing and identify an amplification effect that substantially influences Transformer training. Specifically, for each layer in a multi-layer Transformer, heavy dependency on its residual branch makes training unstable, yet light dependency leads to sub-optimal performance.

Dependency and Amplification Effect

Our analysis starts from the observation that Pre-LN is more robust than Post-LN, whereas Post-LN typically leads to a better performance. As shown in Figure 1, we find these two variants have different layer dependency patterns.

With further exploration, we find that for a N-layer residual network, after updating its parameters W to W*, its outputs change is proportion to the dependency on residual branches.

Intuitively, since a larger output change indicates a more unsmooth loss surface, the large dependency complicates training. Moreover, we propose Admin (adaptive model initialization), which starts the training from the area with a smoother surface. More details can be found in our paper.

Quick Start Guide

Our implementation is based on the fairseq package. Please run the following commands to install:

git clone https://github.com/LiyuanLucasLiu/Transforemr-Clinic.git
cd fairseq
pip install --editable .

The guidance for reproducing our results is available at:

Specifically, our implementation requires to first set --init-type adaptive-profiling and use one GPU for this profiling stage, then set --init-type adaptive and start training.

Citation

Please cite the following paper if you found our model useful. Thanks!

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han (2020). Understanding the Difficulty of Training Transformers. arXiv preprint arXiv:2004.08249 (2020).

@article{liu2020admin,
  title={Understanding the Difficulty of Training Transformers},
  author = {Liu, Liyuan and Liu, Xiaodong and Gao, Jianfeng and Chen, Weizhu and Han, Jiawei},
  journal={arXiv preprint arXiv:2004.08249},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
fairseq		fairseq
img		img
nmt-experiments		nmt-experiments
pre-process		pre-process
radam_fairseq		radam_fairseq
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Admin

Understanding the Difficulty of Training Transformers

Table of Contents

Introduction

What complicates Transformer training?

Dependency and Amplification Effect

Quick Start Guide

Citation

About

Releases

Packages

Languages

License

tarowatanabe/Transformer-Clinic

Folders and files

Latest commit

History

Repository files navigation

Admin

Understanding the Difficulty of Training Transformers

Table of Contents

Introduction

What complicates Transformer training?

Dependency and Amplification Effect

Quick Start Guide

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages