fix GAT theory question

Daniele Solombrino · Daniele Solombrino · commit 305b16ceec19 · 2023-11-18T18:55:19.000+01:00
diff --git a/Theory/AML_theory.ipynb b/Theory/AML_theory.ipynb
@@ -1341,13 +1341,9 @@
         "#### Bonus Question (1 Point)\n",
         "*Q: Compared to the Transformer Network, why doesn't the GAT also have Key, Query and Value?*\n",
         "\n",
-        "*A: In Graph Attention Networks (GAT), a variation of the attention mechanism is used that doesn't explicitly employ separate key, query, and value vectors as in traditional self-attention mechanisms like those found in the Transformer model. Instead, GAT directly computes attention coefficients (attention scores) based on the node embeddings. In GATs, each node in the graph has an associated learnable parameter vector **$a$** for the node itself (called a \"self-attention\" mechanism) and for its neighbors. The attention mechanism computes attention scores between a node and its neighbors based on the dot product of the node's embeddings and the neighbor nodes' embeddings. These attention scores are then used to compute a weighted sum of the neighbor nodes' values, which is used to update the node's representation.\n",
+        "The presented Graph Attention Networks (GAT) version uses **additive** attention, which does not use the query, key and value roles, hence it does not need their linear transformations.\n",
         "\n",
-        "The attention mechanism in GATs is typically defined as follows:\n",
-        "\n",
-        "​$Attention(h_{i},h_{j}) = \\frac{\\sum_{k \\in N(i)} \\exp({\\text{LeakyReLU}(a^{T}[W h_i, W  h_j])})}{\\exp({\\text{LeakyReLU}(a^{T}[W h_i, W  h_j])})}$\n",
-        "\n",
-        "Here, $h_{i}$​ and $h_{j}$ are the node representations, $N_{i}$​ is the set of neighbors of node i, aa is a learnable parameter vector, W is a learnable weight matrix.\n"
+        "Query, key and values have been proposed in dot-product-based attention, which came after this additive attention."
       ]
     },
     {