From c1cbbf32531021fb7a4ff6ad481a555d831d3cca Mon Sep 17 00:00:00 2001 From: Konstantine Tsafatinos Date: Wed, 3 Apr 2024 13:17:32 -0400 Subject: [PATCH] add changes to w3d1 tut2 and clear outputs --- .../W3D1_Tutorial2.ipynb | 457 +++++++++--------- 1 file changed, 236 insertions(+), 221 deletions(-) diff --git a/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial2.ipynb b/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial2.ipynb index fd688a6c7..1ba98918e 100644 --- a/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial2.ipynb +++ b/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial2.ipynb @@ -4,17 +4,17 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "execution": {}, "id": "view-in-github" }, "source": [ - "\"Open   \"Open" + "\"Open" ] }, { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "WlZa8fseqd4K" }, "source": [ "# Tutorial 2: Natural Language Processing and LLMs\n", @@ -23,19 +23,20 @@ "\n", "**By Neuromatch Academy**\n", "\n", - "__Content creators:__ Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang, Alish Dipani\n", + "__Content creators:__ Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang\n", "\n", "__Content reviewers:__ Shaonan Wang, Weizhe Yuan, Dalia Nasr, Stephen Kiilu, Alish Dipani, Dora Zhiyu Yang, Adrita Das\n", "\n", "__Content editors:__ Konrad Kording, Shaonan Wang\n", "\n", - "__Production editors:__ Konrad Kording, Spiros Chavlis" + "__Production editors:__ Konrad Kording, Spiros Chavlis, Konstantine Tsafatinos" ] }, { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "PKMqc6acqd4O" }, "source": [ "---\n", @@ -49,7 +50,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "mzuPjmNdqd4O" }, "outputs": [], "source": [ @@ -63,7 +64,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "3rTu9QpMqd4P" }, "source": [ "---\n", @@ -75,7 +77,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "5MCmd7R7qd4Q" }, "outputs": [], "source": [ @@ -95,7 +97,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "dHhN8BRkqd4Q" }, "outputs": [], "source": [ @@ -124,7 +126,7 @@ "execution_count": null, "metadata": { "cellView": "both", - "execution": {} + "id": "n7b2dC6Fqd4Q" }, "outputs": [], "source": [ @@ -148,7 +150,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "OznG0qY9qd4R" }, "outputs": [], "source": [ @@ -180,7 +182,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "eePtSs_sqd4R" }, "outputs": [], "source": [ @@ -213,7 +215,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "JsAo6X4Iqd4S" }, "outputs": [], "source": [ @@ -225,7 +227,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "flHBFu1Yqd4S" }, "source": [ "---\n", @@ -240,7 +243,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "RyRAy6cvqd4S" }, "outputs": [], "source": [ @@ -294,7 +297,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "P48-d7DBqd4S" }, "source": [ "A core principle of Natural Language Processing is embedding words as vectors. In the relevant vector space, words with similar meanings are close to one another.\n", @@ -309,7 +313,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "yHW7FIWbqd4T" }, "outputs": [], "source": [ @@ -320,7 +324,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "JL1gaT_1qd4T" }, "source": [ "---\n", @@ -334,7 +339,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "4UpUD8F9qd4T" }, "outputs": [], "source": [ @@ -390,7 +395,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "5P14Tf9nqd4T" }, "outputs": [], "source": [ @@ -401,7 +406,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "ck80cJ0Iqd4T" }, "source": [ "## Tokenizers\n", @@ -418,7 +424,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "tYKiSMPpqd4T" }, "source": [ "## Generating a dataset\n", @@ -432,7 +439,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "js5TYMPGqd4T" }, "outputs": [], "source": [ @@ -445,7 +452,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "Ks7zztuMqd4U" }, "outputs": [], "source": [ @@ -456,7 +463,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "XVM-q3Gyqd4U" }, "outputs": [], "source": [ @@ -471,7 +478,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "8Gfq2xOxqd4U" }, "source": [ "Now we will create the actual `Tokenizer`, adhering to the [`hf.Tokenizer` protocol](https://huggingface.co/docs/transformers/main_classes/tokenizer). (Adhering to a standard protocol enables us to swap in our tokenizer for any tokenizer in the huggingface ecosystem or to apply our own tokenizer to any model in the huggingface ecosystem.)\n", @@ -490,10 +498,11 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "ykV5SJFKqd4U" }, "outputs": [], "source": [ + "# Try playing with these hyperparameters!\n", "VOCAB_SIZE = 12_000" ] }, @@ -501,7 +510,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "m46DTJh4qd4U" }, "outputs": [], "source": [ @@ -517,7 +526,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "cXlqpIIwqd4U" }, "source": [ "## Tokenizer Features\n", @@ -529,7 +539,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "sUMua_-oqd4U" }, "outputs": [], "source": [ @@ -548,7 +558,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "Xgz81D5Oqd4U" }, "source": [ "Next, we'll add a pre-tokenizer. The pre-tokenizer is applied to the text after normalizing it but before it's tokenized. The pre-tokenizer is useful for splitting text into chunks, which are easier to tokenize. For example, we can split text into chunks separated by punctuation or whitespace." @@ -558,7 +569,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "szO-uA1Iqd4V" }, "outputs": [], "source": [ @@ -569,17 +580,11 @@ "])" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note:** In practice, it is not necessary to use pre-tokenizers, but we use it for demonstration purposes. For instance, \"2-3\" is not the same as \"23\", so removing punctuation or splitting up digits or punctuation is a bad idea! Moreover, the current tokenizer is powerful enough to deal with punctuation." - ] - }, { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "7qyA2uyEqd4V" }, "source": [ "Finally, we'll train the tokenizer with our dataset. After all, we want a tokenizer that works well on this dataset. There are a few different algorithms for training tokenizers. Here are two common ones:\n", @@ -594,7 +599,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "kuRtXEHsqd4V" }, "outputs": [], "source": [ @@ -607,28 +612,11 @@ ")" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Special Tokens\n", - "\n", - "Tokenizers often have special tokens representing certain concepts such as:\n", - "* [PAD]: Added to the end of shorter input sequences to ensure equal input length for the whole batch\n", - "* [START]: Start of the sequence\n", - "* [END]: End of the sequence\n", - "* [UNK]: Unknown characters not present in the vocabulary\n", - "* [BOS]: Beginning of sentence\n", - "* [EOS]: End of sentence\n", - "* [SEP]: Separation between two sentences in a sequence\n", - "* [CLS]: Token used for classification tasks to represent the whole sequence\n", - "* [MASK]: Used in pre-training phase for masked language modeling tasks in models like BERT" - ] - }, { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "2_vheCz0qd4V" }, "source": [ "Those special tokens are important because it tells the WordPiece training process how to treat phrases, masks, and unknown tokens.\n", @@ -640,7 +628,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "qXuF9c1Jqd4V" }, "outputs": [], "source": [ @@ -653,7 +641,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "x5wgVldUqd4V" }, "outputs": [], "source": [ @@ -664,7 +652,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "vfcYUR_nqd4V" }, "outputs": [], "source": [ @@ -689,7 +677,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "8AKDhuLhqd4W" }, "source": [ "(The `##` means that the token is a continuation of the previous chunk.)\n", @@ -708,10 +697,51 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "qVrsC42Wqd4W" }, "source": [ - "### Think 2.1! Tokenizer good practices\n", + "### Think 2.1! Is it a good idea to do pre_tokenizers?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rNVo80utqd4W" + }, + "outputs": [], + "source": [ + "# to_remove explanation\n", + "\n", + "\"\"\"\n", + "Probably not. For instance, \"2-3\" is not the same as \"23\", so removing\n", + "punctuation or splitting up digits or punctuation is a bad idea!\n", + "Moreover, the current tokenizer is powerful enough to deal with punctuation.\n", + "\"\"\";" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "GNML3IQbqd4W" + }, + "outputs": [], + "source": [ + "# @title Submit your feedback\n", + "content_review(f\"{feedback_prefix}_Is_it_a_good_idea_to_do_pre_tokenizers_Discussion\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": {}, + "id": "g2PiW9HSqd4b" + }, + "source": [ + "### Think 2.2! Tokenizer good practices\n", "\n", "We established that the tokenizer is a better move than the One-Hot-Encoder because it can handle out-of-vocabulary words. But what if we just made a one-hot encoding where the vocabulary is all possible two-character combinations? Would there still be an advantage to the tokenizer?\n", "\n", @@ -722,7 +752,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "x-2fXkFfqd4b" }, "outputs": [], "source": [ @@ -776,7 +806,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "3rMQyk4jqd4b" }, "outputs": [], "source": [ @@ -787,10 +817,11 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "fIlA57Ugqd4b" }, "source": [ - "### Think 2.2: Chinese and English tokenizer\n", + "### Think 2.3: Chinese and English tokenizer\n", "\n", "Let's think about a language like Chinese, where words are each composed of a relatively fewer number of characters compared to English (`hungry` is six unicode characters, but `饿` is one unicode character), but there are many more unique Chinese characters than there are letters in the English alphabet.\n", "\n", @@ -801,7 +832,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "J_zmktEfqd4b" }, "outputs": [], "source": [ @@ -820,7 +851,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "cVLIc1o-qd4c" }, "outputs": [], "source": [ @@ -831,7 +862,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "0rhXwvHZqd4c" }, "source": [ "---\n", @@ -850,7 +882,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "OVh95iK7qd4c" }, "outputs": [], "source": [ @@ -906,7 +938,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "0KDWCEScqd4c" }, "outputs": [], "source": [ @@ -917,7 +949,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "m2w0XGZLqd4c" }, "source": [ "# Section 4: NLG with GPT\n", @@ -933,7 +966,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "k4wAf_b5qd4c" }, "outputs": [], "source": [ @@ -989,7 +1022,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "uRKXa6YGqd4d" }, "outputs": [], "source": [ @@ -1000,7 +1033,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "tfi-LrTUqd4d" }, "source": [ "## Using state-of-the-art (SOTA) Models\n", @@ -1015,7 +1049,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "qoCoTz5oqd4d" }, "source": [ "We're first going to pick a tokenizer. You can see some of the options [here](https://huggingface.co/transformers/pretrained_models.html). We'll use CodeParrot tokenizer, which is a BPE tokenizer. But you can choose (or build!) another if you'd like to try offroading!" @@ -1025,7 +1060,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "Gtqs7Omuqd4d" }, "outputs": [], "source": [ @@ -1037,7 +1072,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "QXHCuUnQqd4d" }, "outputs": [], "source": [ @@ -1047,7 +1082,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "_dbK4t1uqd4d" }, "source": [ "### Think 4.1! Tokenizers\n", @@ -1059,7 +1095,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "6ZRSEmA8qd4e" }, "outputs": [], "source": [ @@ -1078,7 +1114,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "JYrMg9zpqd4e" }, "outputs": [], "source": [ @@ -1089,7 +1125,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "1LmiXIPjqd4e" }, "source": [ "Next, we'll download a pre-built model architecture. CodeParrot (the model) is a GPT-2 model, which is a transformer-based language model. You can see some of the options [here](https://huggingface.co/transformers/pretrained_models.html). But you can choose (or build!) another!\n", @@ -1103,7 +1140,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "qGLCKY-Rqd4e" }, "outputs": [], "source": [ @@ -1122,7 +1159,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "8TFP5ZFYqd4e" }, "outputs": [], "source": [ @@ -1139,7 +1176,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "hWuzPm5Oqd4e" }, "source": [ "Recall that these tokens are integer indices in the vocabulary of the tokenizer. We can use the tokenizer to decode these tokens into a string, which we can print out to see what the model generates." @@ -1149,7 +1187,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "LLEMRD2dqd4e" }, "outputs": [], "source": [ @@ -1162,7 +1200,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "HYHBlY8mqd4f" }, "source": [ "**(Quick knowledge-check: what are the weirdly-rendering characters representing?)**\n", @@ -1178,7 +1217,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "grCI-0Ofqd4f" }, "outputs": [], "source": [ @@ -1189,7 +1228,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "NuVA7k75qd4f" }, "outputs": [], "source": [ @@ -1199,7 +1238,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "3NEB_D8Bqd4f" }, "source": [ "Let's see if we can fool our model now! The huggingface documentation tells us that the codeparrot model was trained to generate Python code ([docs](https://huggingface.co/codeparrot/codeparrot-small)). Let's see if we can get it to generate some JavaScript." @@ -1209,7 +1249,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "Rjr9ft40qd4f" }, "outputs": [], "source": [ @@ -1221,7 +1261,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "AMzqr5inqd4f" }, "source": [ "Yikes! I don't know what it generated for you, but what it made for me was:\n", @@ -1249,7 +1290,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "pvixlG4Pqd4f" }, "source": [ "### Think 4.2! Using SOTA models\n", @@ -1265,7 +1307,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "ghT48wIAqd4f" }, "outputs": [], "source": [ @@ -1290,7 +1332,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "CoUKAmj4qd4g" }, "outputs": [], "source": [ @@ -1301,7 +1343,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "AhM34Rnfqd4g" }, "source": [ "## Fine-Tuning\n", @@ -1317,7 +1360,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "TWyn0q5kqd4g" }, "outputs": [], "source": [ @@ -1336,7 +1379,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "AGFcGqlHqd4g" }, "source": [ "Like training any model, we need to define a training loop and an evaluation metric.\n", @@ -1347,10 +1391,11 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "BIfwDJ8Lqd4g" }, "source": [ - "### Implement the code to fine-tune the model\n", + "### Coding Exercise 4.1: Implement the code to fine-tune the model\n", "\n", "Here are the big pieces of what we do below:\n", "\n", @@ -1365,7 +1410,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "13iHRAMDqd4g" }, "outputs": [], "source": [ @@ -1401,136 +1446,67 @@ " tokenizer=tokenizer, mlm=False,\n", ")\n", "\n", - "# Trainer:\n", - "trainer = Trainer(\n", - " model=model,\n", - " args=training_args,\n", - " train_dataset=encoded_dataset,\n", - " tokenizer=tokenizer,\n", - " compute_metrics=compute_metrics,\n", - " data_collator=data_collator,\n", - ")" + "trainer = ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "g0pVRl6mqd4g" }, "outputs": [], "source": [ - "# Run the actual training:\n", - "trainer.train()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Coding Exercise 4.1: Implement the code to generate text after fine-tuning.\n", - "\n", - "To generate text, we provide input tokens to the model, let it generate the next token and append it into the input tokens. Now, keep repeating this process until you reach the desired output length." + "trainer = ..." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "id": "Rv3dPA5Zqd4h" + }, "outputs": [], "source": [ - "# Number of tokens to generate\n", - "num_tokens = 100\n", - "\n", - "# Move the model to the CPU for inference\n", - "model.to(\"cpu\")\n", - "\n", - "# Print input prompt\n", - "print(f'Input prompt: \\n{input_prompt}')\n", - "\n", - "# Encode the input prompt\n", - "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n", - "input_tokens = ...\n", - "\n", - "# Turn off storing gradients\n", - "with torch.no_grad():\n", - " # Keep iterating until num_tokens are generated\n", - " for tkn_idx in tqdm(range(num_tokens)):\n", - " # Forward pass through the model\n", - " # The model expects the tensor to be of Long or Int dtype\n", - " output = ...\n", - " # Get output logits\n", - " logits = output.logits[-1, :]\n", - " # Convert into probabilities\n", - " probs = nn.functional.softmax(logits, dim=-1)\n", - " # Get the index of top token\n", - " top_token = ...\n", - " # Append the token into the input sequence\n", - " input_tokens.append(top_token)\n", + "# to_remove solution\n", "\n", - "# Decode and print the generated text\n", - "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n", - "decoded_text = ...\n", - "print(f'Generated text: \\n{decoded_text}')" + "trainer = Trainer(\n", + " model=model,\n", + " args=training_args,\n", + " train_dataset=encoded_dataset,\n", + " tokenizer=tokenizer,\n", + " compute_metrics=compute_metrics,\n", + " data_collator=data_collator,\n", + ")" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "id": "3KNNTF7dqd4h" + }, "outputs": [], "source": [ - "# to_remove solution\n", - "\n", - "\n", - "# Number of tokens to generate\n", - "num_tokens = 100\n", - "\n", - "# Move the model to the CPU for inference\n", - "model.to(\"cpu\")\n", - "\n", - "# Print input prompt\n", - "print(f'Input prompt: \\n{input_prompt}')\n", - "\n", - "# Encode the input prompt\n", - "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n", - "input_tokens = tokenizer.encode(input_prompt)\n", - "\n", - "# Turn off storing gradients\n", - "with torch.no_grad():\n", - " # Keep iterating until num_tokens are generated\n", - " for tkn_idx in tqdm(range(num_tokens)):\n", - " # Forward pass through the model\n", - " output = model(torch.IntTensor(input_tokens))\n", - " # Get output logits\n", - " logits = output.logits[-1, :]\n", - " # Convert into probabilities\n", - " probs = nn.functional.softmax(logits, dim=-1)\n", - " # Get the index of top token\n", - " top_token = torch.argmax(probs).item()\n", - " # Append the token into the input sequence\n", - " input_tokens.append(top_token)\n", - "\n", - "# Decode and print the generated text\n", - "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n", - "decoded_text = tokenizer.decode(input_tokens)\n", - "print(f'Generated text: \\n{decoded_text}')" + "# Run the actual training:\n", + "trainer.train()" ] }, { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "Dd4QRunMqd4h" }, "source": [ - "We can also directly generate text using the generation_pipeline:" + "Finally, we will try our model on the same code snippet to see how it performs after fine-tuning:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "GjebpogEqd4h" }, "outputs": [], "source": [ @@ -1546,7 +1522,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "G6H_vRsJqd4h" }, "source": [ "Of course, your results will be slightly different. Here's what I got:\n", @@ -1569,7 +1546,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "XxDwG3XDqd4h" }, "outputs": [], "source": [ @@ -1580,7 +1557,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "xbuKlWAnqd4h" }, "source": [ "### Think 4.3! Accuracy metric observations\n", @@ -1594,7 +1572,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "execution": {} + "id": "NzYAvAx3qd4h" }, "outputs": [], "source": [ @@ -1614,7 +1592,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "bPAM1aKIqd4i" }, "outputs": [], "source": [ @@ -1625,7 +1603,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "AKNGC7ymqd4i" }, "source": [ "---\n", @@ -1639,7 +1618,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "z63HA-Hoqd4i" }, "outputs": [], "source": [ @@ -1695,7 +1674,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "pP2EGNt9qd4i" }, "outputs": [], "source": [ @@ -1706,16 +1685,50 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "9MqFmxIhqd4i" }, "source": [ "## Play around with LLMs\n", "\n", - "Try the following questions with [ChatGPT](https://openai.com/blog/chatgpt) (GPT3.5 without access to the web) and with GPTBing in creative mode (GPT4 with access to the web). Note that the latter requires installing Microsoft Edge.\n", + "1. Try using LLMs' API to do tasks, such as utilizing the GPT-2 API to extend text from a provided context. To achieve this, ensure you have a HuggingFace account and secure an API token.\n", "\n", - "Pick someone you know who is likely to have a web presence but is not super famous (not Musk or Trump). Ask GPT for a two-paragraph biography. How good is it?\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cB-5TRfTs7TK" + }, + "outputs": [], + "source": [ + "import requests\n", + "\n", + " def query(payload, model_id, api_token):\n", + " headers = {\"Authorization\": f\"Bearer {api_token}\"}\n", + " API_URL = f\"https://api-inference.huggingface.co/models/{model_id}\"\n", + " response = requests.post(API_URL, headers=headers, json=payload)\n", + " return response.json()\n", + "\n", + " model_id = \"gpt2\"\n", + " api_token = \"hf_****\" # get yours at hf.co/settings/tokens\n", + " data = query(\"The goal of life is\", model_id, api_token)\n", + " print(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lOPt6NMls-oQ" + }, + "source": [ + "2. Try the following questions with [ChatGPT](https://openai.com/blog/chatgpt) (GPT3.5 without access to the web) and with GPTBing in creative mode (GPT4 with access to the web). Note that the latter requires installing Microsoft Edge.\n", + "\n", + " Pick someone you know who is likely to have a web presence but is not super famous (not Musk or Trump). Ask GPT for a two-paragraph biography. How good is it?\n", "\n", - "Ask it something like “What is the US, UK, Germany, China, and Japan's per capita income over the past ten years? Plot the data in a single figure” (depending on when and where you run this, you will need to paste the resulting Python code into a colab notebook). Try asking it questions about the data or the definition of “per capita income” used. How good is it?" + " Ask it something like “What is the US, UK, Germany, China, and Japan's per capita income over the past ten years? Plot the data in a single figure” (depending on when and where you run this, you will need to paste the resulting Python code into a colab notebook). Try asking it questions about the data or the definition of “per capita income” used. How good is it?" ] }, { @@ -1723,7 +1736,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "pRgPoonuqd4i" }, "outputs": [], "source": [ @@ -1734,7 +1747,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "TJ_XV6nFqd4i" }, "source": [ "---\n", @@ -1746,7 +1760,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "lVUA61hKqd4j" }, "source": [ "---\n", @@ -1760,7 +1775,8 @@ { "cell_type": "markdown", "metadata": { - "execution": {} + "execution": {}, + "id": "Fc4DaHLnqd4j" }, "source": [ "---\n", @@ -1774,7 +1790,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "-zdzgpy2qd4j" }, "outputs": [], "source": [ @@ -1830,7 +1846,7 @@ "execution_count": null, "metadata": { "cellView": "form", - "execution": {} + "id": "44UM_kpPqd4j" }, "outputs": [], "source": [ @@ -1842,7 +1858,6 @@ "metadata": { "accelerator": "GPU", "colab": { - "collapsed_sections": [], "gpuType": "T4", "include_colab_link": true, "name": "W3D1_Tutorial2", @@ -1856,7 +1871,7 @@ "name": "python3" }, "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -1870,10 +1885,10 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.3" + "version": "3.11.8" }, "toc-autonumbering": true }, "nbformat": 4, - "nbformat_minor": 0 + "nbformat_minor": 4 }