Add Chapter 3.4 Trafo param count

slds-lmu · Jun 20, 2024 · f319bc6 · f319bc6
1 parent c969eac
commit f319bc6
Show file tree

Hide file tree

Showing 3 changed files with 35 additions and 6 deletions.
diff --git a/content/chapters/03_transformer/03_04_trafo-params.md b/content/chapters/03_transformer/03_04_trafo-params.md
@@ -0,0 +1,29 @@
+---
+title: "Chapter 03.04: Transformer Parameter Count"
+weight: 3004
+---
+This chapter deals with the number of parameters of the transformer. The parameter count of a transformer model refers to the total number of learnable parameters present in its architecture, which are distributed across various components of the model. 
+These components typically include:
+
+1. **Embedding Layers**: Parameters associated with the input and output embeddings for tokens, which encode their semantic meanings.
+2. **Encoder Layers**: Parameters within each encoder layer, including those associated with self-attention mechanisms, position-wise feedforward networks, and layer normalization.
+3. **Decoder Layers**: Parameters within each decoder layer, including self-attention mechanisms, cross-attention mechanisms, position-wise feedforward networks, and layer normalization.
+4. **Positional Encodings**: Parameters used to encode positional information in the input sequences.
+
+The total parameter count of a transformer model is the sum of parameters from all these components, with variations depending on the specific architecture and hyperparameters chosen for the model.
+
+
+<!--more-->
+
+<!--
+### Lecture video
+{{< video id="TfrSKiOecWI" >}}
+-->
+
+### Lecture Slides
+{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/chapter11/slides/chapter03-transformer/slides-34-trafo-params.pdf" >}}
+
+### Additional Resources
+
+- [Blog about the Transformer Parameter Count](https://towardsdatascience.com/how-to-estimate-the-number-of-parameters-in-transformer-models-ca0f57d8dff0)
+
diff --git a/...chapters/03_transformer/03_04_trafo_xl.md → ...chapters/03_transformer/03_05_trafo_xl.md b/...chapters/03_transformer/03_04_trafo_xl.md → ...chapters/03_transformer/03_05_trafo_xl.md
@@ -1,6 +1,6 @@
 ---
-title: "Chapter 03.04: Long Sequences: Transformer-XL"
-weight: 3004
+title: "Chapter 03.05: Long Sequences: Transformer-XL"
+weight: 3005
 ---
 This chapter is about the Transformer-XL [1] and how it deals with the issue of long sequences. Transformer-XL is an extension of the original Transformer architecture designed to address the limitations of long-range dependency modeling in sequence-to-sequence tasks. It aims to solve the problem of capturing and retaining information over long sequences by introducing a segment-level recurrence mechanism, enabling the model to process sequences of arbitrary length without being constrained by fixed-length contexts or running into computational limitations. Additionally, Transformer-XL incorporates relative positional embeddings to better capture positional information across segments of varying lengths. 
 
@@ -12,7 +12,7 @@ This chapter is about the Transformer-XL [1] and how it deals with the issue of
 -->
 
 ### Lecture Slides
-{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-34-trafo-xl.pdf" >}}
+{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-35-trafo-xl.pdf" >}}
 
 ### References 
 

diff --git a/...hapters/03_transformer/03_05_efficient.md → ...hapters/03_transformer/03_06_efficient.md b/...hapters/03_transformer/03_05_efficient.md → ...hapters/03_transformer/03_06_efficient.md
@@ -1,6 +1,6 @@
 ---
-title: "Chapter 03.05: Efficient Transformers"
-weight: 3005
+title: "Chapter 03.06: Efficient Transformers"
+weight: 3006
 ---
 Efficient Transformers are designed to mitigate the computational and memory requirements of standard transformer architectures, particularly when dealing with large-scale datasets or resource-constrained environments. They aim to address issues such as scalability and efficiency in training and inference. One approach used in efficient transformers is replacing the standard self-attention mechanism with more lightweight attention mechanisms, which reduce the computational complexity of attending to long sequences by approximating the attention mechanism with lower-rank matrices or restricting attention to local or sparse regions of the sequence. These approaches enable transformers to be more practical for real-world applications where computational resources are limited.
 
@@ -12,7 +12,7 @@ Efficient Transformers are designed to mitigate the computational and memory req
 -->
 
 ### Lecture Slides
-{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-35-efficient.pdf" >}}
+{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-36-efficient.pdf" >}}
 
 ### Additional Resources