diff --git a/README.md b/README.md index c0deea8ca..7c1452ad3 100644 --- a/README.md +++ b/README.md @@ -42,13 +42,6 @@ For the latest stable release, please see the [releases page](https://github.com ### Requirements NeMo-Aligner has the same requirements as the [NeMo Toolkit Requirements](https://github.com/NVIDIA/NeMo#requirements) with the addition of [PyTriton](https://github.com/triton-inference-server/pytriton). -### Quick start inside NeMo container -NeMo Aligner comes included with NeMo containers. On a machine with NVIDIA GPUs and drivers installed run NeMo container: -```bash -docker run --gpus all -it --rm --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.07 -``` -Once you are inside the container, NeMo-Aligner is already installed and together with NeMo and other tools can be found under ```/opt/``` folder. - ### Install NeMo-Aligner Please follow the same steps as outlined in the [NeMo Toolkit Installation Guide](https://github.com/NVIDIA/NeMo#installation). After installing NeMo, execute the following additional command: ```bash diff --git a/docs/user-guide/dpo.rst b/docs/user-guide/dpo.rst index d5c39a814..2cd054d32 100644 --- a/docs/user-guide/dpo.rst +++ b/docs/user-guide/dpo.rst @@ -5,28 +5,25 @@ Model Alignment by DPO, RPO, and IPO @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -.. note:: - Before starting this tutorial, be sure to review the :ref:`introduction ` for tips on setting up your NeMo-Aligner environment. - The NeMo Framework supports efficient model alignment via the NeMo-Aligner codebase. -All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length `__. The same tutorial also works for GPT models (such as LLaMa3) of any size. +All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length `__. The same tutorial also works for GPT models (such as LLaMa2) of any size. DPO with LoRA ############# We support both full-parameter DPO training and LoRA DPO training. -In full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see the `DPO paper `__). +For full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see `DPO paper `__). For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights. RPO and IPO Variations ####################### -Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity Preference Optimization (IPO) and Reward-aware Preference Optimization (RPO). +Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO). The algorithm is identified with the ``dpo.preference_loss`` config variable. We support three sorts of RPO algorithms based on the distance metric: ``rpo_sq`` for squared distance, ``rpo_bwd_kl`` for Bernoulli backward KL divergence, and ``rpo_fwd_kl`` for Bernoulli forward KL divergence. -To use the RPO algorithm, each dataset example should have ``chosen_reward`` and ``rejected_reward``, which might come from human labelers or reward models. If ``chosen_reward`` and ``rejected_reward`` are not existent in the data, ``dpo.default_chosen_reward`` and ``dpo.default_rejected_reward`` are used. +To use the RPO algorithm, each dataset example should have chosen_reward and rejected_reward, which might come from human labelers or reward models. If chosen_reward and rejected_reward are not existent in the data, dpo.default_chosen_reward and dpo.default_rejected_reward are used. Obtain a Pretrained Model ############################ @@ -39,18 +36,18 @@ To start, we must first get a pretrained model to align. There are two models we #. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo``. #. Extract the NeMo File to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint``. - #. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here `__. + #. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here `__. .. code-block:: bash python convert_nemo_gpt_to_mcore.py \ --in-folder ./model_checkpoint \ --out-file ./mcore_gpt.nemo - .. tab-item:: LLaMa3 7B + .. tab-item:: LLaMa2 7B :sync: key2 - #. Download the `Llama 3 8B LLM model and tokenizer `__ into the models folder. - #. Convert the LLaMa3 LLM into ``.nemo`` format. + #. Download the `Llama 2 7B LLM model and tokenizer `__ into the models folder. + #. Convert the LLaMa2 LLM into ``.nemo`` format. .. code-block:: bash python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \ @@ -81,7 +78,7 @@ For best DPO training performance, it is recommended that you start with a SFT m DPO Model Training ##################### -Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects ``.jsonl`` files where each line is a JSON dict corresponding to a single, complete sample, as shown below:: +Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects .jsonl files where each line is a JSON dict corresponding to a single, complete sample, as shown below:: {"prompt": "Which year was the Magna Carta signed?", "chosen_response": "1215", "rejected_response": "I refuse to answer this question."} {"prompt": "Please give me the name of a famous medieval painter.", "chosen_response": "Hieronymus Bosch", "rejected_response": "David Hockney"} @@ -91,12 +88,12 @@ However, please be aware that most Megatron GPT models adhere to a strict format {"prompt": "System\n\nUser\nWhich year was the Magna Carta signed?\nAssistant\n", "chosen_response": "1215\n", "rejected_response": "I refuse to answer this question.\n"} {"prompt": "System\n\nUser\nPlease give me the name of a famous medieval painter.\nAssistant\n", "chosen_response": "Hieronymus Bosch\n", "rejected_response": "David Hockney\n"} -Always follow the prompt-response template format used during your SFT training for DPO, as failure to do so will produce a model which outputs garbage text. You should create one ``.jsonl`` file in the format above for your training data and one ``.jsonl`` for your validation data. +Always follow the prompt-response template format used during your SFT training for DPO, as failure to do so will produce a model which outputs garbage text. You should create one jsonl file in the format above for your training data and one jsonl for your validation data. Your JSONL file must contain at least as many samples as the Global Batch Size (GBS) you plan to use during training. For example, if GBS = 64, ensure that both your training and validation files include at least 64 samples. Using a file with fewer samples than the GBS will result in a crash. Once your data is processed into the correct format, you are ready to begin DPO training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the DPO model. -For the purposes of the following sections, we assume that your training ``.jsonl`` file is located in ``/path/to/train_dpo_format.jsonl`` and your validation ``.jsonl`` file is located in ``/path/to/valid_dpo_format.jsonl``. +For the purposes of the following sections, we assume that your training jsonl file is located in ``/path/to/train_dpo_format.jsonl`` and your validation jsonl file is located in ``/path/to/valid_dpo_format.jsonl``. For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` corresponds to the beta parameter in the DPO paper. @@ -199,7 +196,7 @@ All metrics will be grouped by either ``train/`` or ``val/`` in WandB, represent When it comes to ideal hyperparameters for DPO training, much will depend on the characteristics of your SFT or base/foundation model. Consequently, there are no one-size-fits-all parameters that will universally work in all cases. However, the following list is a brief overview of which hyperparameters we have perturbed for various model sizes and their effects: -* global_batch_size: Generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained. -* epochs: Highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs. -* learning rate: We tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7. -* ref_policy_kl_penalty: We generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too. \ No newline at end of file +* global_batch_size: generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained. +* epochs: highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs. +* learning rate: we tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7. +* ref_policy_kl_penalty: we generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too. diff --git a/docs/user-guide/draftp.rst b/docs/user-guide/draftp.rst index d4e2fc9fc..b6fdb8d3c 100644 --- a/docs/user-guide/draftp.rst +++ b/docs/user-guide/draftp.rst @@ -2,20 +2,18 @@ .. _model-aligner-draftp: -Fine-Tuning Stable Diffusion with DRaFT+ +Fine-tuning Stable Diffusion with DRaFT+ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -.. note:: - Before starting this tutorial, be sure to review the :ref:`introduction ` for tips on setting up your NeMo-Aligner environment. - -In this tutorial, we will go through the step-by-step guide for fine-tuning a Stable Diffusion model using DRaFT+ algorithm by NVIDIA. -DRaFT+ enhances the DRaFT `DRaFT `__ algorithm by mitigating mode collapse and improving diversity through regularization. +In this tutorial, we will go through the step-by-step guide for fine-tuning Stable Diffusion model using DRaFT+ algorithm by NVIDIA. +DRaFT+ is an improvement over the `DRaFT `__ algorithm by alleviating the mode collapse and improving diversity through regularization. For more technical details on the DRaFT+ algorithm, check out our technical blog. -Data Input for Running DRaFT+ + +Data Input for running DRaFT+ ############################# -The data for running DRaFT+ should be a ``.tar`` file consisting of a plain prompt. You can generate a tar file from a ``.txt`` +The data for running DRaFT+ should be a ``.tar`` file consisting of a plain prompt. You can generate a tarfile from a ``.txt`` file containing the prompts separated by new lines, such as following format:: prompt1 @@ -37,7 +35,7 @@ Use the following script to download and save the prompts from the `Pick a pic < for caption in captions: file.write(caption + '\n') -You can then run the following snippet to convert it to a ``.tar`` file: +You can then run the following snipet to convert it to a ``.tar`` file: .. code-block:: bash @@ -66,8 +64,8 @@ you can use the `conversion script `__ and -`VAE `__ components of a trained Stable Diffusion model, as well as a checkpoint for the Reward Model. +To launch reward model training, you must have checkpoints for `UNet `__ and +`VAE `__ of a trained Stable Diffusion model and a checkpoint for the Reward Model. .. tab-set:: @@ -169,7 +167,7 @@ To start reward model training, you need checkpoints for both the `UNet `__ scripts from the NeMo codebase. The generated images with the fine-tuned model should have better prompt alignment and aesthetic quality. -User-controllable Fine-Tuning with Annealed Importance Guidance (AIG) -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +User controllable finetuning with Annealed Importance Guidance (AIG) +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -AIG provides the inference-time flexibility to interpolate between the base Stable Diffusion model (with low rewards and high diversity) and a DRaFT+ fine-tuned model (with high rewards and low diversity) to obtain images with high rewards and high diversity. AIG inference is easily done by specifying comma-separated ``weight_type`` strategies to interpolate between the base and fine-tuned model. +AIG provides the inference-time flexibility to interpolate between the base Stable Diffusion model (with low rewards and high diversity) and DRaFT-finetuned model (with high rewards and low diversity) to obtain images with high rewards and high diversity. AIG inference is easily done by specifying comma-separated `weight_type` strategies to interpolate between the base and finetuned model. .. tab-set:: .. tab-item:: AIG on Stable Diffusion XL diff --git a/docs/user-guide/modelalignment.rsts b/docs/user-guide/modelalignment.rsts index 450c2d644..6b70f7a18 100644 --- a/docs/user-guide/modelalignment.rsts +++ b/docs/user-guide/modelalignment.rsts @@ -1,34 +1,2 @@ - -.. _model-aligner-intro: - Model Alignment !!!!!!!!!!!!!!! - -Introduction -############ - -NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state-of-the-art model alignment algorithms such as SteerLM, Direct Preference Optimization (DPO), and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless, and helpful. Users can perform end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource-efficient manner. For more technical details, please refer to our `paper `__. - -The NeMo-Aligner toolkit is built using the `NeMo Toolkit `__ which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross-compatible with the NeMo ecosystem, allowing for inference deployment and further customization. - -The toolkit is currently in its early stages. We are committed to improving the toolkit to make it easier for developers to pick and choose different alignment algorithms to build safe, helpful, and reliable models. - -Get Started -########### - -NeMo-Aligner comes preinstalled in NVIDIA NeMo containers. NeMo containers are launched concurrently with NeMo version updates. - -To get access to the container, log in to the NVIDIA GPU Cloud (NGC) platform or create a free NGC account here: `NVIDIA NGC `__. Once you have logged in, you can get the container here: `NVIDIA NGC NeMo Framework `__. - -To use a pre-built container, run the following code: - - .. code-block:: bash - - docker run -it --gpus=all --shm-size=8g --workdir /opt/NeMo-Aligner nvcr.io/nvidia/nemo:24.09 - - Please use the latest tag in the form yy.mm.(patch). - -.. note:: - Some of the subsequent tutorials require accessing gated Hugging Face models. For details on how to access these models, refer to ``this document ``__. - - diff --git a/docs/user-guide/rlhf.rst b/docs/user-guide/rlhf.rst index dd31815f9..48906a79f 100644 --- a/docs/user-guide/rlhf.rst +++ b/docs/user-guide/rlhf.rst @@ -5,17 +5,14 @@ Model Alignment by RLHF @@@@@@@@@@@@@@@@@@@@@@@ -.. note:: - Before starting this tutorial, be sure to review the :ref:`introduction ` for tips on setting up your NeMo-Aligner environment. - -For the purposes of this tutorial, we will go through the entire Reinforcement Learning from Human Feedback (RLHF) pipeline using models from the NeMo Framework. These models can include LLaMa or Mistral, and our scripts will function consistently across them. +For the purposes of this tutorial, we will go through the entire Reinforcement Learning from Human Feedback (RLHF) pipeline using models from the NeMo Framework. These models can include LLaMa2 or Mistral, and our scripts will function consistently across them. RLHF is usually preceded by a Supervised Fine-Tuning (SFT). We should first follow the :ref:`Prerequisite guide ` and the :ref:`SFT guide `. After obtaining the SFT model, we will use this to start the RLHF process. We will use the `PPO `__ algorithm for reinforcement learning on the `Anthropic-HH-RLHF `__ dataset. Data Processing for RLHF ######################### -We have a script ready to use for processing the Anthropic-HH dataset into a JSONL format. Run the following command on the `download_and_process.py `__ script for anthropic HH. +We have a script ready to use for processing the Anthropic-HH dataset into a jsonlines format. Run the following command on the `download_and_process.py `__ script for anthropic HH. .. code-block:: bash @@ -143,13 +140,11 @@ To launch reward model training, you must start with a pretrained or SFT-trained set +x -.. note:: - Currently, the example training script does not automatically run evaluation on the provided test set. This may change in a future release. - +*Remark: Currently, the example training script does not automatically run evaluation on the provided test set. This may change in a future release.* During reward model training, it’s expected that the validation accuracy improves as the training progresses. In the example provided above using Slurm, we achieved a validation accuracy of 69.57%. -Upon completing the training, NeMo-Aligner will save a ``megatron_gpt.nemo`` file, which serves as the reward model needed for the RL stage. +With the finished training, NeMo-Aligner will save a ``megatron_gpt.nemo`` which is the reward model we need for the RL stage. PPO Training ############ @@ -159,16 +154,16 @@ After you have fine-tuned a GPT model using SFT and trained a reward model as ex During PPO training, we conceptually have four models that interact with each other: #. The PPO Actor Network (also known as the Policy Network): This is the model we are training. It should start from an SFT model. -#. The Reward Model (RM) Network (also known as a Preference Model): This model takes a prompt concatenated with a response as input and outputs a single scalar value, the reward, which the PPO algorithm will try to maximize. +#. The Reward Model (RM) Network (also known as a Preference Model (PM)): This model takes a prompt concatenated with a response as input and outputs a single scalar value, the reward, which the PPO algorithm will try to maximize. #. The PPO Critic Network (also known as the Value Network): Since PPO is an Actor-Critic algorithm, we need a Critic to guide the Actor during training. The Critic will provide value estimates for each token in the responses provided by the Actor. These values can be seen as an estimate of the total reward the Actor will receive after generating all the remaining tokens. The Critic should be initialized from the RM so as to provide useful feedback in the early stages of training. Note: The RM generates a single reward for the entire sequence, whereas the Critic generates a value for each token. #. The Initial Policy Network (also known as the Reference Model): We use this model to compute a KL Divergence penalty term that ensures that the PPO Actor does not diverge too much from the Initial Policy. This way, we prevent the PPO Actor from overfitting to the rewards given by the RM, and ensure it does not forget the knowledge it acquired during pretraining and SFT. This model should be the one used to initialize the PPO Actor Network. -In the most optimized configuration, NeMo-Aligner will run the Actor and initial policy within the same job, as well as the Critic and reward model within the same job. It will then use CPU offloading to load back the corresponding model when needed. +In the most optimized configuration, NeMo-Aligner will run the Actor and initial policy within the same job as well as the Critic and reward model within the same job. It will then use CPU offloading to load back the corresponding model when needed. The next section discusses how to launch each of these two jobs. -Launch the Reward Model and Critic Server -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +Launching the Reward Model and Critic Server +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To launch the server: @@ -203,8 +198,8 @@ To launch the server: The above example launches the reward model Critic server on 8 GPUs and 1 node. Please make sure to change ``trainer.devices``, ``trainer.num_nodes`` depending on your model size and scale. NeMo-Aligner will work on any scale. In addition, make sure to tune the `trainer.ppo.inference_micro_batch_size` argument as this determines the batch size the PPO Actor is allowed to send to the Critic per DP rank. -Launch the Initial Policy and PPO Actor Training -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +Launching the Initial Policy and PPO Actor Training +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The PPO Actor training job contains the master controller that makes the HTTP calls to all servers when needed. To launch the PPO Actor and Initial Policy server: @@ -245,8 +240,8 @@ The above script launches the initial and Actor server on 1 node with 8 GPUs. .. note:: For more info on PPO hyperparameters, see `PPO Hparams `__. -Launch Both Servers for RLHF Training -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +Launching Both Servers for RLHF Training +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% You can use Slurm to launch both jobs and coordinate them together in a full RLHF job using the following script: @@ -367,8 +362,8 @@ It is important to launch all jobs with ``&`` after the srun command, to ensure .. note:: Make sure to change the Critic arg ``trainer.ppo.inference_micro_batch_size`` such that ``trainer.ppo.inference_micro_batch_size * DP size <= model.ppo.rollout_micro_batch_size``. -Speed up PPO with TensorRT-LLM -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +Speeding up PPO with TensorRT-LLM +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% NeMo-Aligner has support for accelerating RLHF with `TensorRT-LLM `__. This can provide a significant speedup to the PPO training time when enabled. There are a few crucial flags to set when using TensorRT-LLM with Aligner. #. `trainer.ppo.trt_llm.enable=True` enables TensorRT-LLM. @@ -377,7 +372,7 @@ NeMo-Aligner has support for accelerating RLHF with `TensorRT-LLM `__. +For more information please see the aligner `paper `__. PPO Results with TensorRT-LLM %%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -401,9 +396,7 @@ We test the scaling of our TRT-LLM integration by running Llama3 70B Actor and L | 64 | 32 | 334 | 56.9 | 6.44 | +------------------+-------------------+-----------------------------+----------------------+--------------------+ -.. note:: - for 64x32 config we used a ``rollout_micro_batch_size`` of 16 instead of 8 due to the additional memory from the the distributed optimizer. - +NOTE: for 64x32 config we used a rollout_micro_batch_size of 16 instead of 8 since we have more memory coming from the distributed optimizer. We also support running RLHF on Llama3.1 405B Actor and Reward Model. The following numbers are generated with ``num_rollout_samples=128``, ``global_batch_size=128``, reshard turned off, engine offloading set to False. @@ -413,14 +406,14 @@ We also support running RLHF on Llama3.1 405B Actor and Reward Model. The follow | 84 | 42 | 915.6 | 164.6 | +------------------+-------------------+----------------------------+--------------------+ -In the future, we aim to improve the performance of generation with large models that have high pipeline parallelism size. +In the future we aim to improve the performance of generation with large models that have high pipeline parallelism size. PPO Results %%%%%%%%%%% Once you've completed RLHF training, you can serve your model using the `megatron_gpt_eval.py `__ script from the NeMo codebase to run more rigorous evaluation of your trained model. -Scale the Tutorial to Bigger Models -################################### +Scaling the Tutorial to Bigger Models +##################################### While the tutorial above provides a way to get started with RLHF, it doesn’t represent the most optimal performance or convergence configuration. When running RLHF fully, we anticipate achieving an MT-bench score improvement of approximately +0.4 to +0.5. It’s essential to begin with a high-quality SFT model and closely monitor the response length. diff --git a/docs/user-guide/rs.rst b/docs/user-guide/rs.rst index 44862eeaa..ac7ea30ee 100644 --- a/docs/user-guide/rs.rst +++ b/docs/user-guide/rs.rst @@ -5,27 +5,24 @@ Model Alignment by Rejection Sampling @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -.. note:: - Before starting this tutorial, be sure to review the :ref:`introduction ` for tips on setting up your NeMo-Aligner environment. - -In this tutorial, we will guide you through the process of aligning a NeMo Framework model using rejection sampling. This method can be applied to various models, including LLaMa and Mistral, with our scripts functioning consistently across different models. +In this tutorial, we will guide you through the process of aligning a NeMo Framework model using rejection sampling. This method can be applied to various models, including LLaMa2 and Mistral, with our scripts functioning consistently across different models. Rejection Sampling is usually preceded by a Supervised Fine-Tuning (SFT). We should first follow the :ref:`Prerequisite guide ` and the :ref:`SFT guide `. After obtaining the SFT model, we will also need to train a reward model as in :ref:`PPO guide `. We will use the rejection sampling algorithm on the `Anthropic-HH-RLHF `__ dataset. Rejection Sampling Training -########################### +############ -After you have fine-tuned a GPT model using SFT and trained a reward model as explained in the preceding section, you can start aligning the policy using rejection sampling. +After you have fine-tuned a GPT model using Supervised Fine-Tuning (SFT), and trained a reward model as explained in the preceding section, you can start aligning the policy using rejection sampling. -During rejection sampling training, we have two models interacting with each other, which NeMo-Aligner runs in separate jobs: +During rejection sampling training, we have two models interacting with each other, which Aligner runs in separate jobs: #. The Policy Network: This is the model we are training and it should start from an SFT model. #. The Reward Model (RM): This model accepts a prompt combined with a response as input and produces a single scalar value, known as the reward. The rejection sampling algorithm aims to maximize this reward. The next section discusses how to launch each of these two jobs. -Launch the Reward Model and Critic Server -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +Launching the Reward Model and Critic Server +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To launch the server: @@ -46,12 +43,12 @@ To launch the server: rm_model_file=${RM_NEMO_FILE} -The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change ``trainer.devices`` and ``trainer.num_nodes`` depending on your model size and scale. NeMo-Aligner will work on any scale. Also, make sure to tune the ``trainer.rs.inference_micro_batch_size`` argument. This argument sets the size of the batch the RS actor is allowed to send to the critic per DP rank. +The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.rs.inference_micro_batch_size argument. This argument sets the size of the batch the RS actor is allowed to send to the critic per DP rank. Launch the Initial Policy and RS Actor Training %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -The RS actor training job contains the master controller that makes the HTTP calls to all servers when needed. To launch the RS actor and initial policy server: +The RS Actor training job contains the master controller that makes the HTTP calls to all servers when needed. To launch the RS Actor and Initial Policy server: .. code-block:: bash @@ -98,12 +95,12 @@ The RS actor training job contains the master controller that makes the HTTP cal model.rs.num_rollouts_per_prompt=8 \ model.rs.top_n_rollouts=1 -The above command launches the RS actor and initial policy server on 1 node with 8 GPUs. +The above command launches the initial and actor server on 1 node with 8 GPUs. -Launch Both Servers for Rejection Sampling Training -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +Launching Both Servers for Rejection Sampling training +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -You can use Slurm to launch the two jobs and get them to coordinate together in a full rejection sampling job via the following: +You can use slurm to launch the 2 jobs and get them to coordinate together in a full Rejection Sampling job via the following: .. code-block:: bash @@ -220,7 +217,7 @@ You can use Slurm to launch the two jobs and get them to coordinate together in wait -The above script runs the reward model server on 1 node and the RS actor on 1 node. +The above script runs the reward model server on 1 node and the actor on 1 node. It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. diff --git a/docs/user-guide/sft.rst b/docs/user-guide/sft.rst index d6beed8d6..0bed1703a 100644 --- a/docs/user-guide/sft.rst +++ b/docs/user-guide/sft.rst @@ -7,7 +7,7 @@ Obtain a Pretrained Model The NeMo Framework supports efficient model alignment using the NeMo-Aligner codebase. All algorithms in NeMo-Aligner will work with any NeMo GPT-based model. To see a collection of scripts that convert popular models from Hugging Face to ``.nemo`` format, go `here `__. -To get started, you need to obtain a pretrained model to align. Three models are recommended: 2B GPT, LLama3-8B, or Nemotron-340B. For demonstration purposes, the smaller 2B model will be used, but you can follow the rest of the tutorial with any of these three models. +To get started, you need to obtain a pretrained model to align. Three models are recommended: 2B GPT, LLama2-7B, or Nemotron-340B. For demonstration purposes, the smaller 2B model will be used, but you can follow the rest of the tutorial with either model. .. tab-set:: @@ -19,18 +19,15 @@ To get started, you need to obtain a pretrained model to align. Three models are 3. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here `__. .. code-block:: bash - python /opt/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py \ + python convert_gpt_nemo_to_mcore.py \ --input_name_or_path ./model_checkpoint \ --output_path ./mcore_gpt.nemo - .. tab-item:: LLaMa3-8B + .. tab-item:: LLaMa2-7B :sync: key2 - 1. Download the `Llama3-8B LLM model and tokenizer `__ into the model's folder. You can use the Hugging Face CLI for this: - .. code-block:: bash - huggingface-cli download meta-llama/Meta-Llama-3-8B --local-dir Meta-Llama-3-8B - - 2. Convert the LLaMa3 LLM into ``.nemo`` format. + 1. Download the `Llama2-7B LLM model and tokenizer `__ into the model's folder. + 2. Convert the LLaMa2 LLM into ``.nemo`` format. .. code-block:: bash python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \ @@ -41,13 +38,12 @@ To get started, you need to obtain a pretrained model to align. Three models are 1. Download the model from `Hugging Face `__. 2. For all scripts, point ``*.restore_from_path`` to the directory where you have downloaded the files. - .. note:: - Because of the 340B's size, it is recommended that you use TP8 PP24 which will be safe for algorithms in NeMo-Aligner. + Note: Because of the 340B's size, it is recommended that you use TP8 PP24 which will be safe for algorithms in Aligner. After these steps, you will have a file called ``mcore_gpt.nemo`` to use in NeMo-Aligner. .. note:: - If you bring your own .nemo model, make sure to change the `model.encoder_seq_length` in the NeMo-Aligner configs to match the sequence length of your own model. + If you bring your own .nemo model, make sure to change the `model.encoder_seq_length` in the Aligner configs to match the sequence length of your own model. .. note:: When working with Megatron Core models, which utilize the Transformer engine as a backend, the system attempts to find efficient kernels. However, depending on your GPU, it may not always locate them. If you encounter errors related to kernel finding, consider setting these variables at the top of your script. @@ -65,18 +61,15 @@ Model Alignment by Supervised Fine-Tuning (SFT) **SFT** is the process of fine-tuning a model's parameters on supervised data of inputs and outputs. It teaches the model how to follow user-specified instructions. It is typically done after model pre-training. It is also an important prerequisite step in Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). Nemo-Aligner supports two types of SFT formats: -1. **Prompt-Response**. In the *Prompt-Response* format, each example contains an input prompt and the annotated response. SFT fine-tunes the base model to follow the prompt instruction and answer in the style of the annotated responses. The Prompt-Response format can be used in various problems like Question Answering (Q&A) and Summarization. +1. **Prompt-Response**. In the *Prompt-Response* format, each example contains an input prompt and the annotated response. SFT fine-tunes the base model to follow the prompt instruction and answer in the style of the annotated responses. The prompt-response format can be used in various problems like Question Answering (Q&A) and Summarization. 2. **Chat**. In the *Chat* format, each example contains a multi-turn conversation between different roles (e.g., *User* and *Assistant*). Fine-tuning the base model on a chat format dataset is useful to align a chatbot. -.. note:: - Before starting this tutorial, be sure to review the :ref:`introduction ` for tips on setting up your NeMo-Aligner environment. - Fine-Tune with a Prompt-Response Dataset %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -Step 1: Format the data. -^^^^^^^^^^^^^^^^^^^^^^^^ +Step 1: Format the data +^^^^^^^^^^^^^^^^^^^^^^^ This example uses the `Dolly dataset `__ to demonstrate how to format your SFT data. This dataset consists of 15,000 instruction-context-response triples. @@ -112,7 +105,7 @@ This approach eliminates the need for padding and improves GPU utilization. Refe NeMo provides a script to pack your SFT prompt-response dataset. Refer to the `prepare dataset `_ section of the documentation for details on how to use this script. -Step 2: Run SFT training. +Step 2: Run SFT training ^^^^^^^^^^^^^^^^^^^^^^^^^ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. @@ -152,7 +145,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. exp_manager.resume_if_exists=True \ exp_manager.resume_ignore_no_checkpoint=True \ exp_manager.create_checkpoint_callback=True \ - exp_manager.checkpoint_callback_params.monitor=val_loss + exp_manager.checkpoint_callback_params.monitor=validation_loss .. tab-item:: Slurm :sync: key4 @@ -224,7 +217,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. exp_manager.resume_ignore_no_checkpoint=True \ exp_manager.create_checkpoint_callback=True \ exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \ - exp_manager.checkpoint_callback_params.monitor=val_loss + exp_manager.checkpoint_callback_params.monitor=validation_loss EOF srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}" @@ -233,26 +226,21 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. If using sequence packing, replace the data paths with the paths to your packed datasets. For each packed dataset, you should also set ``packed_sequence=True`` in the config: .. code-block:: python - +model.data.train_ds.packed_sequence=True \ +model.data.validation_ds.packed_sequence=True It is not required to pack both the train and validation datasets. If packing only the train dataset, exclude ``+model.data.validation_ds.packed_sequence=True``. -To scale to thousands of GPUs, adjust the ``trainer.num_nodes`` and ``trainer.devices`` accordingly based on the size of your machine. If you are running with a larger model, you may need to -change the parallelism. If you run out of memory with Llama3-8b, add tensor parallelism to your config: - -.. code-block:: bash - model.tensor_model_parallel_size=2 \ +To scale to thousands of GPUs, adjust the ``trainer.num_nodes`` and ``trainer.devices`` accordingly based on the size of your machine. For this particular run on the 2B model, the final training loss is approximately 1.536. Once the training finishes, you’ll find a file called ``megatron_gpt_sft.nemo`` available for use. .. note:: - NeMo Framework supports WandB logging. To get started with WandB, see the `Quick Start Guide `__. You can enable WandB logging with ``exp_manager.create_wandb_logger=True`` and it will log the job results to WandB. + NeMo Framework supports wandb logging. To get started with wandb, see the `Quick Start Guide `__. You can enable wandb logging with ``exp_manager.create_wandb_logger=True`` and it will log the job results to wandb. The provided Slurm scripts rely on the `pyxis `__ Slurm extension, which requires specifying the ``--container-image=`` ``--container-mounts=``. However, it’s important to note that NeMo-Aligner can also function in regular Python environments without this extension. -Step 3: Run inference or further fine-tuning. +Step 3: Run inference or further fine-tuning ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given the trained SFT model, you can run inference on new examples or fine-tune the SFT model to boost the performance (e.g., RLHF or DPO). It is important to note that their inputs need to follow the **Prompt Template** used in this model. The template is set by ``data.train_ds.prompt_template``. The saved NeMo model, ``megatron_gpt_sft.nemo``, also stores the prompt format. You can ``tar -xvf megatron_gpt_sft.nemo`` and find it in `model_config.yaml`. @@ -262,8 +250,8 @@ In this example, the template is ``"{input} {output}"``. Fine-Tune with a Chat Dataset %%%%%%%%%%%%%%%%%%%%%%%%%%%%% -Step 1: Format the data. -^^^^^^^^^^^^^^^^^^^^^^^^ +Step 1: Format the data +^^^^^^^^^^^^^^^^^^^^^^^ In this example, you use the `OpenAssistant dataset `__. Download and convert the dataset into the chat format by using the following script: @@ -272,7 +260,7 @@ In this example, you use the `OpenAssistant dataset `__. This same tutorial also works for GPT models (such as LLaMa3) of any size. +The NeMo framework supports efficient model alignment via the NeMo Aligner codebase. -For details on the SPIN algorithm, refer to the paper: `https://arxiv.org/abs/2401.01335 `__. +All algorithms in NeMo Aligner will work with any GPT based model that is from mcore(i.e in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire SPIN pipeline using the newly released `2B GPT model with 4096 sequence length `__. This same tutorial also works for GPT models(such as LLaMa2) of any size. -.. note:: - Before starting this tutorial, be sure to review the :ref:`introduction ` for tips on setting up your NeMo-Aligner environment. - -Obtain a Pretrained Model -######################### -To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model. +Obtaining a pretrained model +############################ +To start, we must first get a pretrained model to align. There are 2 models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes we will use the smaller 2B model. .. tab-set:: .. tab-item:: 2B GPT :sync: key1 - #. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo``. - #. Extract the NeMo File to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint``. - #. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here `__. + #. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo`` + #. Extract the NeMo File to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint`` + #. And then run the script to convert from old NeMo checkpoint to Megatron-Core checkpoint. The script is located `here `__. .. code-block:: bash python convert_nemo_gpt_to_mcore.py \ --in-folder ./model_checkpoint \ --out-file ./mcore_gpt.nemo - .. tab-item:: LLaMa3 8B + .. tab-item:: LLaMa2 7B :sync: key2 - #. Download the `Llama 3 8B LLM model and tokenizer `__ into the models folder. - #. Convert the LLaMa3 LLM into ``.nemo`` format. + #. Download the `Llama 2 7B LLM model and tokenizer `__ into the models folder. + #. Convert the LLaMa2 LLM into ``.nemo`` format .. code-block:: bash python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \ @@ -43,7 +40,7 @@ To start, we must first get a pretrained model to align. There are two models we After these steps you should have a file ``mcore_gpt.nemo`` to use in NeMo-Aligner. .. note:: - Megatron Core models use TransformerEngine as a backend, which attempts to find efficient kernels. However, depending on your GPU, it may not always succeed. If you encounter errors related to kernel finding, set these variables at the top of your script. + Mcore models use TransformerEngine as a backend, and it tries to find efficient kernels. But depending on the GPU you have it may not find them. If you ever face errors that relate to kernel finding set these variables on top of your script. .. code-block:: bash @@ -58,27 +55,27 @@ Helpfully, TransformerEngine exposes a flag to set if you want to guarantee dete export NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 -SFT vs. Foundational (Base) Model for SPIN Training -################################################### -Unlike DPO and PPO, SPIN was designed to run on foundational (base) models—models trained only on autoregressive language prediction tasks and not on instruction-following tasks. -However, you can also run SPIN on models that have been SFTed on instruction-based datasets, similar to DPO/PPO. Both types of models will work well with SPIN. If you prefer to start with a supervised fine-tuned model instead of a base model, please see our full guide on how to perform SFT on a Megatron GPT model :ref:`SFT guide `. +SFT vs Foundational (base) model for SPIN Training +################################################## +Unlike DPO and PPO, SPIN was designed to be run on foundational (base) models, that is, models which have only been trained on autoregressive language prediction tasks and not on instruction following tasks. +However, you can also run SPIN on models which have been SFTed on instruction-based datasets as well, similar to DPO/PPO. Either type of model will work well with SPIN. If you would like to start with a supervised fine tuned model instead of a base model, please see our full guide on how to perform SFT on a Megatron GPT model :ref:`SFT guide `. SPIN Model Training ################### SPIN training uses the exact same dataset formatting and files as the NeMo-Aligner SFT trainer. Please see the data formatting section of SFT to understand the data format necessary for SPIN :ref:`SFT guide ` -Once your data is processed into the correct format, you are ready to begin SPIN training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the SPIN model. -For the purposes of the following sections, we'll assume your training .jsonl file is located in ``/path/to/train_spin_format.jsonl`` and your validation .jsonl file is located in ``/path/to/valid_spin_format.jsonl``. +Once your data is processed into the correct format you are ready to begin SPIN training. You must start with a pretrained or SFT trained model. For this section we will use the SFT model trained in the previous step to train the SPIN model. +For the purposes of the following sections, we'll assume your training jsonl file is located in ``/path/to/train_spin_format.jsonl`` and your validation jsonl file is located in ``/path/to/valid_spin_format.jsonl``. -For the following parameters, the ``model.spin.ref_policy_kl_penalty`` corresponds to the beta parameter in the SPIN paper and ``trainer.spin.max_iterations`` corresponds to T (with ``trainer.spin.max_epochs`` epochs per iteration). +For the below parameters, the ``model.spin.ref_policy_kl_penalty`` corresponds to the beta parameter in the SPIN paper, and ``trainer.spin.max_iterations`` corresponds to T (with ``trainer.spin.max_epochs`` epochs per iteration) .. tab-set:: .. tab-item:: Terminal :sync: key3 - To run SPIN model training on the terminal directly: + To run SPIN model training on the terminal directly .. code-block:: bash @@ -107,7 +104,7 @@ For the following parameters, the ``model.spin.ref_policy_kl_penalty`` correspon .. tab-item:: Slurm :sync: key4 - The following script uses 4 nodes, but you can change the node count to something different. To run SPIN model training using Slurm: + To run SPIN model training using Slurm. The script below uses 4 nodes, but you can change the node count to something different. .. code-block:: bash @@ -168,20 +165,18 @@ For the following parameters, the ``model.spin.ref_policy_kl_penalty`` correspon srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}" set +x -During SPIN training, several metrics will be recorded to WandB for you to monitor, chiefly acc (representing the percentage by which the model's implicit reward for the ground truth response exceeds that of the response generated by the reference policy). +During SPIN training, there will be several metrics recorded to WandB which you can monitor, chiefly acc (representing the percentage amount whereby the model's implicit reward for the ground truth response is greater than for the response generated by the reference policy). The ``reward`` in this case is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses. During training, the acc should generally be increasing, but don't worry if its absolute value remains low, as it doesn't correlate to finalised MTBench or MMLU scores. It should just be generally increasing. Other metrics to keep an eye on are the rewards_actual_mean and rewards_generated_mean, which represent the average of the ``rewards`` as defined above. Again, the absolute values aren't necessarily so important as the fact that the actual_mean should be greater than the generated_mean over time, and the greater that difference, the better. All metrics will be grouped by either ``train/`` or ``val/`` in WandB, representing whether that metric is from the training or validation set, respectively. - -.. note:: - For validation, we calculate only a vanilla SFT negative log-likelihood loss instead of using the formal SPIN loss. As a result, validation metrics will include only the SFT NLL loss. This approach speeds up the validation process, as performing SPIN generation is time-consuming and not strictly necessary for validation. +NOTE: for validation we only calculate a vanilla SFT negative log-likelihood loss instead of using the formal SPIN loss, so for validation metrics there will only be the SFT NLL loss. We do this to speed up the validation aspect of training, as doing SPIN generation is time consuming, and not really necessary for validation. When it comes to ideal hyperparameters for SPIN training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases. However, the following is a brief overview of which hyperparameters we have perturbed for various model sizes and their effects: -* global_batch_size: The SPIN paper recommends a GBS of 64 for a 7B model, which aligns with our findings -- higher GBS for 7B models results in worse performance. For larger models, you can increase the GBS to 128 or 256 as needed, but starting with 64 as a baseline is recommended. -* iterations/epochs: The SPIN paper used iterations=3 and epochs=2 for their training on a 7B model with a training dataset size of 200k. Using the same foundational model as the authors, we found better results with iterations=1 and epochs=1 using a 50k subset of their 200k data. We, therefore, recommend starting with iterations=1 and increasing to 2 as needed by testing on MT-Bench/MMLU. - Additionally, unlike the SPIN paper, our implementation does not currently inject the generated samples from iteration t-1 into iteration t. This may explain why we do not see any performance increases with iterations greater than 1. -* learning rate: The SPIN paper recommends starting with 5e-7 and annealing down to 1e-7 for the final iteration. We found that this generally works well; however, we also saw good results with a constant learning rate of 4e-7 or 3e-7. -* ref_policy_kl_penalty: This is an area of ongoing research. The SPIN paper recommends starting at 0.1 and increasing up to 5.0 for the final iteration. We find that a beta of 0.1 works well for the first iteration, but subsequent iterations tend to overfit quickly. Raising the KL penalty seems to help, but not enough for T > 1 checkpoints to perform better than T <= 1. For now, we recommend leaving KL at 0.1 and training for a single iteration only. +* global_batch_size: the SPIN paper recommends 64 for a 7B model, which we have found holds true, in that higher GBS for 7B models performs much worse. For larger models, you can increase to 128 or 256 as needed, but we recommend you start with 64 as a baseline +* iterations/epochs: the SPIN paper used iterations=3 and epochs=2 for their training on a 7B model with a training dataset size of 200k. Using the same foundational model as the authors, we found better results with iterations=1, epochs=1 using a 50k subset of their 200k data. We therefore recommend starting with iterations=1, and increasing to 2 as needed by testing on MT-Bench/MMLU. + additionally, unlike the SPIN paper, our implementation does not currently inject the generated samples from iteration t-1 into t, and this may be a reason why we do not see any performance increases with iterations > 1. +* learning rate: the SPIN paper recommends starting with 5e-7 and annealing down to 1e-7 for the final iteration. We found that this generally works well, however, we also saw good resutls from a constant learning rate of 4e-7 or 3e-7. +* ref_policy_kl_penalty: this is an area of ongoing research. The SPIN paper recommends startings at 0.1 and increasing up to 5.0 for the final iteration. We find that a beta of 0.1 works well for the first iteration, but subsequent iterations tend to overfit quickly, which raising the KL penalty seems to help with, but not enough that T > 1 checkpoints perform better than T <= 1. For now, we recommend leaving KL at 0.1 and training for a single iteration only. diff --git a/docs/user-guide/steerlm.rst b/docs/user-guide/steerlm.rst index 5eda95e2b..635abe5a3 100644 --- a/docs/user-guide/steerlm.rst +++ b/docs/user-guide/steerlm.rst @@ -6,14 +6,11 @@ Model Alignment by SteerLM Method @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -**SteerLM** is a novel approach developed by the NVIDIA NeMo Team, introduced as part of NVIDIA NeMo Alignment methods. It simplifies the customization of large language models (LLMs) and empowers users with dynamic control over model outputs by specifying desired attributes. -Despite remarkable progress in natural language generation driven by LLMs like GPT-3, Megatron-Turing, Chinchilla, PaLM-2, Falcon, and Llama 2, these foundational models often fall short in delivering nuanced and user-aligned responses. -The current approach for LLM improvement combines Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), but it comes with complexities and limited user control. -SteerLM addresses these challenges and represents a significant advancement in the field, making it easier to tailor LLMs to specific needs and preferences. This document delves into how SteerLM operates and offers guidance on training a SteerLM model. +**SteerLM** is a novel approach developed by the NVIDIA NeMo Team, introduced as part of NVIDIA NeMo Alignment methods. It simplifies the customization of large language models (LLMs) and empowers users with dynamic control over model outputs by specifying desired attributes. Despite remarkable progress in natural language generation driven by LLMs like GPT-3, Megatron-Turing, Chinchilla, PaLM-2, Falcon, and Llama 2, these foundational models often fall short in delivering nuanced and user-aligned responses. The current approach for LLM improvement combines supervised fine-tuning and reinforcement learning from human feedback, but it comes with complexities and limited user control. SteerLM addresses these challenges and represents a significant advancement in the field, making it easier to tailor LLMs to specific needs and preferences. This document delves into how SteerLM operates and offers guidance on training a SteerLM model. SteerLM ############### -SteerLM leverages a SFT method that empowers you to control responses during inference. It overcomes the limitations of prior alignment techniques, and consists of four key steps: +SteerLM leverages a supervised fine-tuning method that empowers you to control responses during inference. It overcomes the limitations of prior alignment techniques, and consists of four key steps: 1. Train an attribute prediction model on human-annotated datasets to evaluate response quality on any number of attributes like helpfulness, humor, and creativity. @@ -32,87 +29,91 @@ SteerLM simplifies alignment compared to RLHF. It supports user-steerable AI by SteerLM vs RLHF ############### -RLHF and SteerLM are two methods aimed at aligning language models to human preferences. RLHF trains language models by providing positive or negative feedback on generated responses, reinforcing good behaviors. Specifically, the model is encouraged to generate more text similar to responses that receive positive feedback, and less like those with negative feedback. +Reinforcement Learning from Human Feedback (RLHF) and SteerLM are two methods aimed at aligning language models to human preferences. RLHF trains language models by providing positive or negative feedback on generated responses, reinforcing good behaviors. Specifically, the model is encouraged to generate more text similar to responses that receive positive feedback, and less like those with negative feedback. SteerLM takes a different approach to model alignment. Rather than solely reinforcing "good" behaviors, it categorizes the space of possible model responses using steering labels. At inference time, the model generates based on these categorical labels that steer its output. So while RLHF uses direct feedback on model generations, SteerLM aligns by mapping responses into labeled categories associated with human preferences. -The two methods approach model alignment from different angles: RLHF reinforces desired model behaviors directly, while SteerLM steers generation based on categorical labels. Both aim to produce language model outputs that are better aligned with human values and preferences. +The two methods tackle model alignment from different angles - RLHF by directly reinforcing desired model behaviors, and SteerLM by steering generation based on categorical labels. Both aim to produce language model outputs better aligned with human values and preferences. .. note:: - For details on SteerLM, please refer to our paper `SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF `_. - For details about the HelpSteer dataset, please refer to our paper `HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM `_. + For details of SteerLM, please refer to our paper `SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF `_. + For details of HelpSteer dataset, please refer to our paper `HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM `_. Train a SteerLM model ##################### -This section is a step-by-step tutorial that walks you through how to run a full SteerLM pipeline with a Llama2 70B LLM model. +This section is a step-by-step tutorial that walks you through how to run a full SteerLM pipeline with a Llama2 70B LLM model. It includes the following: -.. note:: - Before starting this tutorial, be sure to review the :ref:`introduction ` for tips on setting up your NeMo-Aligner environment. +1. Data download and preprocessing + +2. Training the attribute prediction model (aka regression reward model) -Download the Llama 2 LLM model -^^^^^^^^^^^^^^^^^^^^^^^^^^ +3. Training the attribute-conditioned SFT -#. Download the Llama 2 70B LLM model from HF into the models folder. +4. Inference on the SteerLM model with different attribute values -#. Convert the Llama 2 LLM into .nemo format: - .. code-block:: bash +Step 1: Download Llama 2 LLM model +############################################################# +Download the Llama 2 70B LLM model from HF into the models folder. - mkdir -p /models/llama70b/ - python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path /path/to/llama --output_path /models/llama70b/llama70b.nemo +Then convert the Llama 2 LLM into .nemo format: -#. Download and convert to .nemo format for the 13B model . This is needed for the Attribute Prediction Modeling step. +.. code-block:: bash -#. Untar the .nemo file to obtain the tokenizer in NeMo format (only for the 70B model): + mkdir -p /models/llama70b/ + python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path /path/to/llama --output_path /models/llama70b/llama70b.nemo - .. code-block:: bash +Download and convert to .nemo format for the 13B model as well, which is needed for the Attribute Prediction Modelling step. - cd /models/llama70b - tar xvf llama70b.nemo . - rm llama70b.nemo +Untar the .nemo file to obtain the tokenizer in NeMo format (only for the 70B model): - mv _tokenizer.model tokenizer.model +.. code-block:: bash + + cd /models/llama70b + tar xvf llama70b.nemo . + rm llama70b.nemo + + mv _tokenizer.model tokenizer.model The prefix for the tokenizer would be different when extracted. Ensure that the correct tokenizer file is used when running the preceding command. -Download and Preprocess Data for Attribute Prediction Modeling -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 2: Download and Preprocess data for Attribute Prediction Modelling +####################################################################### -#. Download and convert both datasets into a common format: +First, download and convert both datasets into a common format. - .. code-block:: bash +.. code-block:: bash - python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst - - python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer_data.py --output_directory=data/helpsteer + python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst + + python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer_data.py --output_directory=data/helpsteer -#. Merge the two datasets for the train and val subset respectively: +Then, merge the two datasets for the train and val subset respectively. - .. code-block:: bash +.. code-block:: bash - cat data/oasst/train.jsonl data/helpsteer/train.jsonl | awk '{for(i=1;i<=4;i++) print}' > data/merge_train.jsonl + cat data/oasst/train.jsonl data/helpsteer/train.jsonl | awk '{for(i=1;i<=4;i++) print}' > data/merge_train.jsonl - cat data/oasst/val.jsonl data/helpsteer/val.jsonl > data/merge_val.jsonl + cat data/oasst/val.jsonl data/helpsteer/val.jsonl > data/merge_val.jsonl -#. Preprocess the data into regression reward model training format: +Finally, preprocess the data into regression reward model training format. - .. code-block:: bash +.. code-block:: bash - python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \ - --input-file=data/merge_train.jsonl \ - --output-file=data/merge_train_reg.jsonl + python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \ + --input-file=data/merge_train.jsonl \ + --output-file=data/merge_train_reg.jsonl - python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \ - --input-file=data/merge_val.jsonl \ - --output-file=data/merge_val_reg.jsonl + python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \ + --input-file=data/merge_val.jsonl \ + --output-file=data/merge_val_reg.jsonl -Train the Regression Reward Model on OASST+HelpSteer Data -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 3: Train the regression reward model on OASST+HelpSteer data +################################################################# For this tutorial, train the regression reward model for 800 steps. -.. note:: - Depending on the type of cluster you use, you may need to set up multi-node training in your cluster env. For details, please refer to https://lightning.ai/docs/pytorch/stable/clouds/cluster.html. +Note that you would need to set up multi-node training in your cluster env, depending on the type of cluster you use. For details, please refer to https://lightning.ai/docs/pytorch/stable/clouds/cluster.html .. code-block:: bash @@ -142,42 +143,41 @@ For this tutorial, train the regression reward model for 800 steps. model.regression.num_attributes=9 -Generate Annotations -^^^^^^^^^^^^^^^^^^^^ +Step 4: Generate annotations +############################ +To generate annotations, run the following command in the background to launch an inference server: -#. To generate annotations, run the following command in the background to launch an inference server: - - .. code-block:: bash +.. code-block:: bash - python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \ - rm_model_file=/results/reward_model_13b/checkpoints/megatron_gpt.nemo \ - trainer.num_nodes=1 \ - trainer.devices=8 \ - ++model.tensor_model_parallel_size=4 \ - ++model.pipeline_model_parallel_size=1 \ - inference.micro_batch_size=2 \ - inference.port=1424 + python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \ + rm_model_file=/results/reward_model_13b/checkpoints/megatron_gpt.nemo \ + trainer.num_nodes=1 \ + trainer.devices=8 \ + ++model.tensor_model_parallel_size=4 \ + ++model.pipeline_model_parallel_size=1 \ + inference.micro_batch_size=2 \ + inference.port=1424 -#. Execute the following code: +Now execute: - .. code-block:: bash +.. code-block:: bash - python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \ - --input-file=data/oasst/train.jsonl \ - --output-file=data/oasst/train_labeled.jsonl \ - --port=1424 + python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \ + --input-file=data/oasst/train.jsonl \ + --output-file=data/oasst/train_labeled.jsonl \ + --port=1424 - python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \ - --input-file=data/oasst/val.jsonl \ - --output-file=data/oasst/val_labeled.jsonl \ - --port=1424 + python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \ + --input-file=data/oasst/val.jsonl \ + --output-file=data/oasst/val_labeled.jsonl \ + --port=1424 - cat data/oasst/train_labeled.jsonl data/oasst/train_labeled.jsonl > data/oasst/train_labeled_2ep.jsonl + cat data/oasst/train_labeled.jsonl data/oasst/train_labeled.jsonl > data/oasst/train_labeled_2ep.jsonl -Train the Attribute-Conditioned SFT Model -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Step 5: Train the Attribute-Conditioned SFT model +################################################# For the purposes of this tutorial, the Attribute-Conditioned SFT model is trained for 800 steps. @@ -233,111 +233,110 @@ For the purposes of this tutorial, the Attribute-Conditioned SFT model is traine -Run Inference -^^^^^^^^^^^^^ - -#. To start inference, run an inference server in the background using the following command: +Step 6: Inference +################## +To start inference, run an inference server in the background using the following command: - .. code-block:: bash +.. code-block:: bash - python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \ - gpt_model_file=/results/acsft_70b/checkpoints/megatron_gpt_sft.nemo \ - pipeline_model_parallel_split_rank=0 \ - server=True \ - tensor_model_parallel_size=8 \ - pipeline_model_parallel_size=1 \ - trainer.precision=bf16 \ - trainer.devices=8 \ - trainer.num_nodes=1 \ - web_server=False \ - port=1427 + python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \ + gpt_model_file=/results/acsft_70b/checkpoints/megatron_gpt_sft.nemo \ + pipeline_model_parallel_split_rank=0 \ + server=True \ + tensor_model_parallel_size=8 \ + pipeline_model_parallel_size=1 \ + trainer.precision=bf16 \ + trainer.devices=8 \ + trainer.num_nodes=1 \ + web_server=False \ + port=1427 - Please wait for the server to be ready before proceeeding. +Please wait for the server to be ready before proceeeding. -#. Create Python helper functions: +Next, create Python helper functions: - .. code-block:: python +.. code-block:: python - import requests - from collections import OrderedDict - - def get_answer(question, max_tokens, values, eval_port=1427): - prompt = ( - "System\nA chat between a curious user and an artificial intelligence assistant. " - "The assistant gives helpful, detailed, and polite answers to the user's questions.\n" - "User\n{question}\nAssistant\n{values}\n" - ) - prompts = [prompt.format(question=question, values=values)] - data = { - "sentences": prompts, - "tokens_to_generate": max_tokens, - "top_k": 1, - "greedy": True, - "end_strings": [""], - } - url = f"http://localhost:{eval_port}/generate" - response = requests.put(url, json=data) - json_response = response.json() - response_sentence = json_response["sentences"][0][len(prompt):] - return response_sentence - - .. code-block:: python - - def encode_labels(labels): - return ",".join(f"{key}:{value}" for key, value in labels.items()) - -#. Change the values below to steer the language model: - - .. code-block:: python - - values = OrderedDict( - [ - ("quality", 4), - ("toxicity", 0), - ("humor", 0), - ("creativity", 0), - ("helpfulness", 4), - ("correctness", 4), - ("coherence", 4), - ("complexity", 4), - ("verbosity", 4), - ] + import requests + from collections import OrderedDict + + def get_answer(question, max_tokens, values, eval_port=1427): + prompt = ( + "System\nA chat between a curious user and an artificial intelligence assistant. " + "The assistant gives helpful, detailed, and polite answers to the user's questions.\n" + "User\n{question}\nAssistant\n{values}\n" ) - values = encode_labels(values) - -#. Ask questions and generate responses: - - .. code-block:: python - - question = "Write a poem on NVIDIA in the style of Shakespeare" - print(get_answer(question, 512, values)) - - The response is shown below. - - .. code-block:: python - - """ - In days of yore, in tech's great hall, - A company arose, NVIDIA its call. - With graphics cards, it did astound, - And gaming world with awe did abound. - - But NVIDIA's reach far more than play, - Its GPUs now deep learning's sway. - With neural nets and data vast, - AI's rise, it did forecast. - - From self-driving cars to medical scans, - Its tech now touches all life's plans. - With each new day, its impact grows, - In science, research, and industry's prose. - - So here's to NVIDIA, whose name we praise, - For tech that captivates in countless ways. - With Shakespearean verse, we now impart, - Our thanks and admiration from the heart. - - """ + prompts = [prompt.format(question=question, values=values)] + data = { + "sentences": prompts, + "tokens_to_generate": max_tokens, + "top_k": 1, + "greedy": True, + "end_strings": [""], + } + url = f"http://localhost:{eval_port}/generate" + response = requests.put(url, json=data) + json_response = response.json() + response_sentence = json_response["sentences"][0][len(prompt):] + return response_sentence + +.. code-block:: python + + def encode_labels(labels): + return ",".join(f"{key}:{value}" for key, value in labels.items()) + +Next, change the values below to steer the language model: + +.. code-block:: python + + values = OrderedDict( + [ + ("quality", 4), + ("toxicity", 0), + ("humor", 0), + ("creativity", 0), + ("helpfulness", 4), + ("correctness", 4), + ("coherence", 4), + ("complexity", 4), + ("verbosity", 4), + ] + ) + values = encode_labels(values) + +Finally, ask questions and generate responses: + +.. code-block:: python + + question = "Write a poem on NVIDIA in the style of Shakespeare" + print(get_answer(question, 512, values)) + +Response is as below + +.. code-block:: python + + """ + In days of yore, in tech's great hall, + A company arose, NVIDIA its call. + With graphics cards, it did astound, + And gaming world with awe did abound. + + But NVIDIA's reach far more than play, + Its GPUs now deep learning's sway. + With neural nets and data vast, + AI's rise, it did forecast. + + From self-driving cars to medical scans, + Its tech now touches all life's plans. + With each new day, its impact grows, + In science, research, and industry's prose. + + So here's to NVIDIA, whose name we praise, + For tech that captivates in countless ways. + With Shakespearean verse, we now impart, + Our thanks and admiration from the heart. + + """ .. note:: diff --git a/docs/user-guide/steerlm2.rst b/docs/user-guide/steerlm2.rst index 366e0be03..b3802f45f 100644 --- a/docs/user-guide/steerlm2.rst +++ b/docs/user-guide/steerlm2.rst @@ -6,10 +6,11 @@ SteerLM 2.0: Iterative Training for Attribute-Conditioned Language Model Alignment @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ -**SteerLM 2.0** is a novel approach for aligning large language models (LLMs) to generate responses with desired attribute values, building upon the original `SteerLM `_ method [1]_ . While SteerLM conducts attribute-conditioned Supervised Fine-Tuning (SFT) to steer LLM outputs, SteerLM 2.0 introduces an iterative training procedure to explicitly enforce the generated responses to follow the desired attribute distribution. +**SteerLM 2.0** is a novel approach for aligning large language models (LLMs) to generate responses with desired attribute values, building upon the original `SteerLM `_ method [1]_ . While SteerLM conducts attribute-conditioned supervised fine-tuning to steer LLM outputs, SteerLM 2.0 introduces an iterative training procedure to explicitly enforce the generated responses to follow the desired attribute distribution. + Overview -######## +########## The goal of SteerLM 2.0 is to train a model :math:`Q_\theta(y|a, x)` that can generate responses :math:`y` conditioned on a prompt :math:`x` and desired attributes :math:`a`, while approximating the optimal conditional distribution :math:`P(y|a, x)` derived from an attribute prediction model :math:`P(a|x, y)` and an unconditional response model :math:`P(y|x)`. SteerLM 2.0 accomplishes this by minimizing the Kullback-Leibler (KL) divergence between :math:`P(y|a, x)` and :math:`Q_\theta(y|a, x)`: @@ -46,12 +47,12 @@ By iteratively training on this loss, SteerLM 2.0 can learn to generate response Train a SteerLM 2.0 Model ########################### -Prepare the Training Dataset ----------------------------- +Preparing the Training Dataset +------------------------------ SteerLM 2.0 requires a specific data format to train the model effectively. According to the SteerLM 2.0 method, the following components are needed: -- A SFT model :math:`P(y|x)` that generates responses :math:`y` given a prompt :math:`x` +- A supervised fine-tuning (SFT) model :math:`P(y|x)` that generates responses :math:`y` given a prompt :math:`x` - An original SteerLM model :math:`Q'(y|a, x)` that generates responses :math:`y` conditioned on attributes :math:`a` and prompt :math:`x` The SteerLM 2.0 model :math:`Q_\theta(y|a, x)` is initialized with the weights from :math:`Q'(y|a, x)` and optimized to approximate the optimal conditional distribution :math:`P(y|a, x)` derived from the attribute prediction model :math:`P(a|x, y)` and the unconditional response model :math:`P(y|x)`. @@ -105,7 +106,7 @@ These values are provided as log(P(a|x,y)), log(P(y|x)), and log(Q(y|a,x)), resp Training Example ------------------ -By organizing the data in this format, the SteerLM 2.0 model can be effectively trained to generate responses that conform to the desired attribute values while approximating the optimal conditional distribution :math:`P(y|a, x)`. The following is an example of launching the training of SteerLM 2.0: +By organizing the data in this format, the SteerLM 2.0 model can be effectively trained to generate responses that conform to the desired attribute values while approximating the optimal conditional distribution :math:`P(y|a, x)`. Following is an example of launching the training of SteerLM 2.0: .. code-block:: bash @@ -119,7 +120,7 @@ By organizing the data in this format, the SteerLM 2.0 model can be effectively trainer.sft.val_check_interval=800 \ trainer.sft.save_interval=800 \ model.megatron_amp_O2=True \ - model.restore_from_path=/path/to/steerlm1/model \ + model.restore_from_path=/models/llama70b \ model.tensor_model_parallel_size=8 \ model.pipeline_model_parallel_size=2 \ model.optim.lr=6e-6 \ @@ -158,8 +159,6 @@ By organizing the data in this format, the SteerLM 2.0 model can be effectively exp_manager.explicit_log_dir=/results/acsft_70b \ exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True -``/path/to/steerlm1/model`` is the path to the initial SteerLM model. For details on training the initial SteerLM model, refer to the :ref:`SteerLM documentation `. - Inference ------------------