ludwig-ai · arnavgarg1 · Sep 25, 2023 · Sep 25, 2023 · Sep 25, 2023 · Sep 27, 2023
@@ -0,0 +1,59 @@
+# Data-Parallel QLoRA Fine-Tuning
+
+If you have a single-node multi-GPU setup with a large dataset that you would like to train using QLoRA, you can use DeepSpeed Stage 0, 1, or 2.
+
+## DeepSpeed Background
+
+As a refresher, here is what each DeepSpeed Zero stage corresponds to:
+
+- **Stage 0**: Disabled, i.e., no partitioning of optimizer state, gradients or model parameters. You can still perform optimizer and parameter offloading, as well training using bf16 or fp16 etc.
+- **Stage 1**: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
+- **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
+- **Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
+
+_NOTE: Data Parallel QLoRA based training only works with DeepSpeed stages \<= 2. This is because DeepSpeed isn't
+compatible with partitioning/sharding of quantized weights as of DeepSpeed 0.10.3 when weights are a mixture of dtypes_. See:
+
+- https://github.com/microsoft/DeepSpeed/issues/4295
+- https://github.com/microsoft/DeepSpeed/issues/3620
+
+In particular, this comment here summarizes it well:
+
+> Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has uint8 and float32 parameters and it throws the error.
+
+## Example Config
+
+The example `train.py` uses DeepSpeed Stage 2 with the Ray backend as follows to fine-tune a model for natural language to code generation task via instruction fine-tuning.
+
+```yaml
+backend:
+  type: ray
+  trainer:
+    use_gpu: true
+    strategy:
+      type: deepspeed
+      zero_optimization:
+        stage: 2
+```
+
+In most cases, stage 2 lets you train large models in distributed fashion across multiple GPUs. However, if you want to use Stage 0 or 1, you can just replace `stage: 2` to the desired zero optimization stage.
+
+## DeepSpeed Zero Stage Benefits
+
+### Benefits of DeepSpeed Stage 0
+
+- **Ease of Use**: Stage 0 is relatively easy to set up and use, making it a good starting point for users looking for memory-efficient training without the complexity of more advanced optimization techniques.
+- **Gradient Accumulation**: Stage 0 enables gradient accumulation, which is beneficial for simulating larger batch sizes even on hardware with memory constraints. This can lead to more stable model training and potentially faster convergence.
+- **Mixed Precision Training**: It supports mixed-precision training, which utilizes lower-precision data types (e.g., float16) to reduce memory usage while maintaining training stability.
+
+### Benefits of DeepSpeed Stage 1
+
+- **Optimizer State Partitioning**: Stage 1 is primarily focused on partitioning the optimizer state, allowing you to train very large models that wouldn't fit within a single GPU's memory.
+- **Memory Efficiency**: It efficiently manages memory by dividing the optimizer state into segments distributed across multiple GPUs. This makes training larger models feasible.
+- **Single-GPU Training**: Stage 1 is especially valuable when you need to train large models on a single GPU, making it an essential step before scaling up to more advanced stages for distributed training.
+- **Limited Configuration Complexity**: It introduces memory efficiency while maintaining a relatively simple configuration setup compared to the more advanced stages like Stage 2 and Stage 3.
+
+### Benefits of DeepSpeed Stage 2
+
+- **Training Extremely Large Models**: ZeRO Stage 2 partitions both the gradients and the optimizer state to reduce memory requirements significantly. By contrast, Stage 0 and Stage 1 do not have the same level of memory optimization to handle models of such magnitude.
+- **Advanced Distributed Training**: ZeRO Stage 2 is designed to handle distributed training at an unprecedented scale. It optimizes communication, gradient aggregation, and synchronization between GPUs and nodes, making it ideal for training large models efficiently in a distributed environment. This advanced distributed training capability is not present in Stage 0 and is more sophisticated than that of Stage 1, which helps in achieving faster training times and handling larger workloads.
@@ -0,0 +1,144 @@
+import logging
+import os
+
+import numpy as np
+import pandas as pd
+import yaml
+
+from ludwig.api import LudwigModel
+from ludwig.datasets import code_alpaca
+
+np.random.seed(123)
+
+
+# Llama-2-7b-hf requires HUGGING_FACE_HUB_TOKEN to be set as an environment variable
+# You can get a token at https://huggingface.co/settings/tokens
+if "HUGGING_FACE_HUB_TOKEN" not in os.environ:
+    raise ValueError(
+        "Please set your Hugging Face Hub token as an environment variable using `export "
+        "HUGGING_FACE_HUB_TOKEN=your_token`. You can get a token at https://huggingface.co/settings/tokens"
+    )
+
+fine_tuning_config = yaml.safe_load(
+    """
+model_type: llm
+base_model: meta-llama/Llama-2-7b-hf
+
+input_features:
+    - name: instruction
+      type: text
+
+output_features:
+    - name: output
+      type: text
+
+prompt:
+    template: >-
+        Below is an instruction that describes a task, paired with an input
+        that provides further context. Write a response that appropriately
+        completes the request.
+
+        ### Instruction: {instruction}
+
+        ### Input: {input}
+
+        ### Response:
+
+generation:
+    temperature: 0.1
+    max_new_tokens: 256
+
+adapter:
+    type: lora
+
+quantization:
+    bits: 4
+
+preprocessing:
+    split:
+        type: random
+        probabilities:
+        - 0.9
+        - 0.05
+        - 0.05
+    global_max_sequence_length: 512
+    sample_size: 1000
+
+backend:
+    type: ray
+    trainer:
+        use_gpu: true
+        strategy:
+            type: deepspeed
+            zero_optimization:
+                stage: 2
+
+trainer:
+    type: finetune
+    epochs: 3
+    batch_size: 1
+    eval_batch_size: 1
+    enable_gradient_checkpointing: true
+    gradient_accumulation_steps: 4
+    learning_rate: 0.0001
+    learning_rate_scheduler:
+        decay: cosine
+        warmup_fraction: 0.03
+    """
+)
+
+df = code_alpaca.load(split=False)
+model = LudwigModel(config=fine_tuning_config, logging_level=logging.INFO)
+
+(
+    train_stats,  # dictionary containing training statistics
+    preprocessed_data,  # tuple Ludwig Dataset objects of pre-processed training data
+    output_directory,  # location of training results stored on disk
+) = model.train(
+    dataset=df,
+    experiment_name="code_alpaca_instruct",
+    model_name="llama2_7b",
+)
+
+# List contents of output directory
+print("Contents of output directory:", output_directory)
+for item in os.listdir(output_directory):
+    print("\t", item)
+
+# Run Inference
+print("Predict")
+prediction_df = pd.DataFrame(
+    [
+        {
+            "instruction": "Create an array of length 5 which contains all even numbers between 1 and 10.",
+            "input": "",
+        },
+        {
+            "instruction": "Create an array of length 15 containing numbers divisible by 3 up to 45.",
+            "input": "",
+        },
+        {
+            "instruction": "Create a nested loop to print every combination of numbers between 0-9",
+            "input": "",
+        },
+        {
+            "instruction": "Generate a function that computes the sum of the numbers in a given list",
+            "input": "",
+        },
+        {
+            "instruction": "Create a class to store student names, ages and grades.",
+            "input": "",
+        },
+        {
+            "instruction": "Print out the values in the following dictionary.",
+            "input": "my_dict = {\n  'name': 'John Doe',\n  'age': 32,\n  'city': 'New York'\n}",
+        },
+    ]
+)
+preds, _ = model.predict(dataset=prediction_df)
+preds = preds.compute()
+for input_with_prediction in zip(prediction_df["instruction"], prediction_df["input"], preds["output_response"]):
+    print(f"Instruction: {input_with_prediction[0]}")
+    print(f"Input: {input_with_prediction[1]}")
+    print(f"Generated Output: {input_with_prediction[2][0]}")
+    print("\n\n")