diff --git a/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial2.ipynb b/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial2.ipynb index 42a9d5fe1..fd688a6c7 100644 --- a/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial2.ipynb +++ b/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/W3D1_Tutorial2.ipynb @@ -23,7 +23,7 @@ "\n", "**By Neuromatch Academy**\n", "\n", - "__Content creators:__ Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang\n", + "__Content creators:__ Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang, Alish Dipani\n", "\n", "__Content reviewers:__ Shaonan Wang, Weizhe Yuan, Dalia Nasr, Stephen Kiilu, Alish Dipani, Dora Zhiyu Yang, Adrita Das\n", "\n", @@ -494,7 +494,6 @@ }, "outputs": [], "source": [ - "# Try playing with these hyperparameters!\n", "VOCAB_SIZE = 12_000" ] }, @@ -570,6 +569,13 @@ "])" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note:** In practice, it is not necessary to use pre-tokenizers, but we use it for demonstration purposes. For instance, \"2-3\" is not the same as \"23\", so removing punctuation or splitting up digits or punctuation is a bad idea! Moreover, the current tokenizer is powerful enough to deal with punctuation." + ] + }, { "cell_type": "markdown", "metadata": { @@ -601,6 +607,24 @@ ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Special Tokens\n", + "\n", + "Tokenizers often have special tokens representing certain concepts such as:\n", + "* [PAD]: Added to the end of shorter input sequences to ensure equal input length for the whole batch\n", + "* [START]: Start of the sequence\n", + "* [END]: End of the sequence\n", + "* [UNK]: Unknown characters not present in the vocabulary\n", + "* [BOS]: Beginning of sentence\n", + "* [EOS]: End of sentence\n", + "* [SEP]: Separation between two sentences in a sequence\n", + "* [CLS]: Token used for classification tasks to represent the whole sequence\n", + "* [MASK]: Used in pre-training phase for masked language modeling tasks in models like BERT" + ] + }, { "cell_type": "markdown", "metadata": { @@ -687,46 +711,7 @@ "execution": {} }, "source": [ - "### Think 2.1! Is it a good idea to do pre_tokenizers?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": {} - }, - "outputs": [], - "source": [ - "# to_remove explanation\n", - "\n", - "\"\"\"\n", - "Probably not. For instance, \"2-3\" is not the same as \"23\", so removing\n", - "punctuation or splitting up digits or punctuation is a bad idea!\n", - "Moreover, the current tokenizer is powerful enough to deal with punctuation.\n", - "\"\"\";" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "execution": {} - }, - "outputs": [], - "source": [ - "# @title Submit your feedback\n", - "content_review(f\"{feedback_prefix}_Is_it_a_good_idea_to_do_pre_tokenizers_Discussion\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "execution": {} - }, - "source": [ - "### Think 2.2! Tokenizer good practices\n", + "### Think 2.1! Tokenizer good practices\n", "\n", "We established that the tokenizer is a better move than the One-Hot-Encoder because it can handle out-of-vocabulary words. But what if we just made a one-hot encoding where the vocabulary is all possible two-character combinations? Would there still be an advantage to the tokenizer?\n", "\n", @@ -805,7 +790,7 @@ "execution": {} }, "source": [ - "### Think 2.3: Chinese and English tokenizer\n", + "### Think 2.2: Chinese and English tokenizer\n", "\n", "Let's think about a language like Chinese, where words are each composed of a relatively fewer number of characters compared to English (`hungry` is six unicode characters, but `饿` is one unicode character), but there are many more unique Chinese characters than there are letters in the English alphabet.\n", "\n", @@ -1365,7 +1350,7 @@ "execution": {} }, "source": [ - "### Coding Exercise 4.1: Implement the code to fine-tune the model\n", + "### Implement the code to fine-tune the model\n", "\n", "Here are the big pieces of what we do below:\n", "\n", @@ -1416,7 +1401,15 @@ " tokenizer=tokenizer, mlm=False,\n", ")\n", "\n", - "trainer = ..." + "# Trainer:\n", + "trainer = Trainer(\n", + " model=model,\n", + " args=training_args,\n", + " train_dataset=encoded_dataset,\n", + " tokenizer=tokenizer,\n", + " compute_metrics=compute_metrics,\n", + " data_collator=data_collator,\n", + ")" ] }, { @@ -1427,39 +1420,101 @@ }, "outputs": [], "source": [ - "trainer = ..." + "# Run the actual training:\n", + "trainer.train()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Coding Exercise 4.1: Implement the code to generate text after fine-tuning.\n", + "\n", + "To generate text, we provide input tokens to the model, let it generate the next token and append it into the input tokens. Now, keep repeating this process until you reach the desired output length." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "execution": {} - }, + "metadata": {}, "outputs": [], "source": [ - "# to_remove solution\n", + "# Number of tokens to generate\n", + "num_tokens = 100\n", "\n", - "trainer = Trainer(\n", - " model=model,\n", - " args=training_args,\n", - " train_dataset=encoded_dataset,\n", - " tokenizer=tokenizer,\n", - " compute_metrics=compute_metrics,\n", - " data_collator=data_collator,\n", - ")" + "# Move the model to the CPU for inference\n", + "model.to(\"cpu\")\n", + "\n", + "# Print input prompt\n", + "print(f'Input prompt: \\n{input_prompt}')\n", + "\n", + "# Encode the input prompt\n", + "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n", + "input_tokens = ...\n", + "\n", + "# Turn off storing gradients\n", + "with torch.no_grad():\n", + " # Keep iterating until num_tokens are generated\n", + " for tkn_idx in tqdm(range(num_tokens)):\n", + " # Forward pass through the model\n", + " # The model expects the tensor to be of Long or Int dtype\n", + " output = ...\n", + " # Get output logits\n", + " logits = output.logits[-1, :]\n", + " # Convert into probabilities\n", + " probs = nn.functional.softmax(logits, dim=-1)\n", + " # Get the index of top token\n", + " top_token = ...\n", + " # Append the token into the input sequence\n", + " input_tokens.append(top_token)\n", + "\n", + "# Decode and print the generated text\n", + "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n", + "decoded_text = ...\n", + "print(f'Generated text: \\n{decoded_text}')" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "execution": {} - }, + "metadata": {}, "outputs": [], "source": [ - "# Run the actual training:\n", - "trainer.train()" + "# to_remove solution\n", + "\n", + "\n", + "# Number of tokens to generate\n", + "num_tokens = 100\n", + "\n", + "# Move the model to the CPU for inference\n", + "model.to(\"cpu\")\n", + "\n", + "# Print input prompt\n", + "print(f'Input prompt: \\n{input_prompt}')\n", + "\n", + "# Encode the input prompt\n", + "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n", + "input_tokens = tokenizer.encode(input_prompt)\n", + "\n", + "# Turn off storing gradients\n", + "with torch.no_grad():\n", + " # Keep iterating until num_tokens are generated\n", + " for tkn_idx in tqdm(range(num_tokens)):\n", + " # Forward pass through the model\n", + " output = model(torch.IntTensor(input_tokens))\n", + " # Get output logits\n", + " logits = output.logits[-1, :]\n", + " # Convert into probabilities\n", + " probs = nn.functional.softmax(logits, dim=-1)\n", + " # Get the index of top token\n", + " top_token = torch.argmax(probs).item()\n", + " # Append the token into the input sequence\n", + " input_tokens.append(top_token)\n", + "\n", + "# Decode and print the generated text\n", + "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n", + "decoded_text = tokenizer.decode(input_tokens)\n", + "print(f'Generated text: \\n{decoded_text}')" ] }, { @@ -1468,7 +1523,7 @@ "execution": {} }, "source": [ - "Finally, we will try our model on the same code snippet to see how it performs after fine-tuning:" + "We can also directly generate text using the generation_pipeline:" ] }, {