diff --git a/notebooks/multi_modal/labs/image_captioning.ipynb b/notebooks/multi_modal/labs/image_captioning.ipynb index f4200b52..0b26ec89 100644 --- a/notebooks/multi_modal/labs/image_captioning.ipynb +++ b/notebooks/multi_modal/labs/image_captioning.ipynb @@ -13,7 +13,7 @@ "Image captioning models take an image as input, and output text. Ideally, we want the output of the model to accurately describe the events/things in the image, similar to a caption a human might provide.
\n", "For example, given an image like the example below, the model is expected to generate a caption such as *\"some people are playing baseball.\"*.\n", "\n", - "
\n", + "
\n", "\n", "In order to generate text, we will build an encoder-decoder model, where the encoder output embedding of an input image, and the decoder output text from the image embedding
\n", "\n", @@ -512,7 +512,6 @@ "* $h_s$ is the sequence of encoder outputs being attended to (the attention \"key\" and \"value\" in transformer terminology).\n", "* $h_t$ is the decoder state attending to the sequence (the attention \"query\" in transformer terminology).\n", "* $c_t$ is the resulting context vector.\n", - "* $a_t$ is the final output combining the \"context\" and \"query\".\n", "\n", "The equations:\n", "\n", diff --git a/notebooks/multi_modal/solutions/image_captioning.ipynb b/notebooks/multi_modal/solutions/image_captioning.ipynb index 3ead05b7..6836088f 100644 --- a/notebooks/multi_modal/solutions/image_captioning.ipynb +++ b/notebooks/multi_modal/solutions/image_captioning.ipynb @@ -668,7 +668,6 @@ "* $h_s$ is the sequence of encoder outputs being attended to (the attention \"key\" and \"value\" in transformer terminology).\n", "* $h_t$ is the decoder state attending to the sequence (the attention \"query\" in transformer terminology).\n", "* $c_t$ is the resulting context vector.\n", - "* $a_t$ is the final output combining the \"context\" and \"query\".\n", "\n", "The equations:\n", "\n",