Skip to content

Commit

Permalink
Add Chapter 3.4 Trafo param count
Browse files Browse the repository at this point in the history
  • Loading branch information
MikeySaw committed Jun 20, 2024
1 parent c969eac commit f319bc6
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 6 deletions.
29 changes: 29 additions & 0 deletions content/chapters/03_transformer/03_04_trafo-params.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: "Chapter 03.04: Transformer Parameter Count"
weight: 3004
---
This chapter deals with the number of parameters of the transformer. The parameter count of a transformer model refers to the total number of learnable parameters present in its architecture, which are distributed across various components of the model.
These components typically include:

1. **Embedding Layers**: Parameters associated with the input and output embeddings for tokens, which encode their semantic meanings.
2. **Encoder Layers**: Parameters within each encoder layer, including those associated with self-attention mechanisms, position-wise feedforward networks, and layer normalization.
3. **Decoder Layers**: Parameters within each decoder layer, including self-attention mechanisms, cross-attention mechanisms, position-wise feedforward networks, and layer normalization.
4. **Positional Encodings**: Parameters used to encode positional information in the input sequences.

The total parameter count of a transformer model is the sum of parameters from all these components, with variations depending on the specific architecture and hyperparameters chosen for the model.


<!--more-->

<!--
### Lecture video
{{< video id="TfrSKiOecWI" >}}
-->

### Lecture Slides
{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/chapter11/slides/chapter03-transformer/slides-34-trafo-params.pdf" >}}

### Additional Resources

- [Blog about the Transformer Parameter Count](https://towardsdatascience.com/how-to-estimate-the-number-of-parameters-in-transformer-models-ca0f57d8dff0)

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Chapter 03.04: Long Sequences: Transformer-XL"
weight: 3004
title: "Chapter 03.05: Long Sequences: Transformer-XL"
weight: 3005
---
This chapter is about the Transformer-XL [1] and how it deals with the issue of long sequences. Transformer-XL is an extension of the original Transformer architecture designed to address the limitations of long-range dependency modeling in sequence-to-sequence tasks. It aims to solve the problem of capturing and retaining information over long sequences by introducing a segment-level recurrence mechanism, enabling the model to process sequences of arbitrary length without being constrained by fixed-length contexts or running into computational limitations. Additionally, Transformer-XL incorporates relative positional embeddings to better capture positional information across segments of varying lengths.

Expand All @@ -12,7 +12,7 @@ This chapter is about the Transformer-XL [1] and how it deals with the issue of
-->

### Lecture Slides
{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-34-trafo-xl.pdf" >}}
{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-35-trafo-xl.pdf" >}}

### References

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Chapter 03.05: Efficient Transformers"
weight: 3005
title: "Chapter 03.06: Efficient Transformers"
weight: 3006
---
Efficient Transformers are designed to mitigate the computational and memory requirements of standard transformer architectures, particularly when dealing with large-scale datasets or resource-constrained environments. They aim to address issues such as scalability and efficiency in training and inference. One approach used in efficient transformers is replacing the standard self-attention mechanism with more lightweight attention mechanisms, which reduce the computational complexity of attending to long sequences by approximating the attention mechanism with lower-rank matrices or restricting attention to local or sparse regions of the sequence. These approaches enable transformers to be more practical for real-world applications where computational resources are limited.

Expand All @@ -12,7 +12,7 @@ Efficient Transformers are designed to mitigate the computational and memory req
-->

### Lecture Slides
{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-35-efficient.pdf" >}}
{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-36-efficient.pdf" >}}

### Additional Resources

Expand Down

0 comments on commit f319bc6

Please sign in to comment.