[distro] feat: add docker support (#41)

* [distro] feat: add docker support * update docker tag * update description
volcengine · Dec 9, 2024 · 50ac725 · 50ac725
1 parent c592a8b
commit 50ac725
Show file tree

Hide file tree

Showing 3 changed files with 137 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,3 @@
-<div align=center>
-  <img src="docs/_static/logo.png" width = "20%" height = "20%" />
-</div>
-
 <h1 style="text-align: center;">veRL: Volcano Engine Reinforcement Learning for LLM</h1>
 
 veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.
@@ -29,66 +25,100 @@ veRL is fast with:
 <!-- <a href=""><b>Slides</b></a> | -->
 </p>
 
+## Installation Guide
+
+Below are the steps to install veRL in your environment.
+
+### Requirements
+- **Python**: Version >= 3.9
+- **CUDA**: Version >= 12.1
+
+veRL supports various backends. Currently, the following configurations are available:
+- **FSDP** and **Megatron-LM** for training.
+- **vLLM** for rollout generation.
+
+**Training backends**
+
+We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in [PyTorch FSDP Backend](https://verl.readthedocs.io/en/latest/workers/fsdp_workers.html)
+
+For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support Megatron-LM@core_v0.4.0 and we fix some internal issues of Megatron-LM. Here's the additional installation guide. The guide for using Megatron-LM backend can be found in [Megatron-LM Backend](https://verl.readthedocs.io/en/latest/workers/megatron_workers.html)
+
+### Installation Options
 
+#### 1. From Docker Image
 
-## Installation
+We provide pre-built Docker images for quick setup.
 
-For installing the latest version of veRL, the best way is to clone and install it from source. Then you can modify our code to customize your own post-training jobs.
+Image and tag: `verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3`
+
+1. Launch the desired Docker image:
 
 ```bash
-# install verl together with some lightweight dependencies in setup.py
-git clone https://github.com/volcengine/verl.git
-cd verl
-pip3 install -e .
+docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v <image:tag> 
 ```
 
-You can also install veRL using `pip3 install`
+2.	Inside the container, install veRL:
 
 ```bash
-# directly install from pypi
-pip3 install verl
+# install the nightly version
+git clone https://github.com/volcengine/verl && cd verl && pip3 install -e .
+# or install from pypi via `pip3 install verl`
 ```
 
-### Dependencies
+4. Setup Megatron (optional)
 
-veRL requires Python >= 3.9 and CUDA >= 12.1.
+If you want to enable training with Megatron, Megatron code must be added to PYTHONPATH:
 
-veRL support various backend, we currently release FSDP and Megatron-LM for actor training and vLLM for rollout generation.
+```bash
+cd ..
+git clone -b core_v0.4.0 https://github.com/NVIDIA/Megatron-LM.git
+cp verl/patches/megatron_v4.patch Megatron-LM/
+cd Megatron-LM && git apply megatron_v4.patch
+pip3 install -e .
+export PYTHONPATH=$PYTHONPATH:$(pwd)
+```
+
+You can also get the Megatron code after verl's patch via
+```bash
+git clone -b core_v0.4.0_verl https://github.com/eric-haibin-lin/Megatron-LM
+```
+
+#### 2. From Custom Environments
+
+<details><summary>If you prefer setting up veRL in your custom environment, expand this section and follow the steps below.</summary>
+
+Using **conda** is recommended for managing dependencies.
 
-To install the dependencies, we recommend using conda:
+1. Create a conda environment:
 
 ```bash
 conda create -n verl python==3.9
 conda activate verl
 ```
 
-The following dependencies are required for all backends.
+2. Install common dependencies (required for all backends)
 
 ```bash
 # install torch [or you can skip this step and let vllm to install the correct version for you]
-pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
+pip3 install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
 
 # install vllm
 pip3 install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
-pip3 install ray==2.10 # other version may have bug
+pip3 install ray
 
 # flash attention 2
 pip3 install flash-attn --no-build-isolation
 ```
 
-**FSDP**
+3. Install veRL
 
-We recommend using FSDP backend to investigate, research and prototype different models, datasets and RL algorithms.
-
-The pros, cons and extension guide for using FSDP backend can be found in [PyTorch FSDP Backend](https://verl.readthedocs.io/en/latest/workers/fsdp_workers.html)
-
-**Megatron-LM**
-
-For users who pursue better scalability, we recommend using Megatron-LM backend. Please install the above dependencies first.
-
-Currently, we support Megatron-LM@core_v0.4.0 and we fix some internal issues of Megatron-LM. Here's the additional installation guide.
+```bash
+# install the nightly version
+git clone https://github.com/volcengine/verl && cd verl && pip3 install -e .
+# or install from pypi via `pip3 install verl`
+```
 
-The pros, cons and extension guide for using Megatron-LM backend can be found in [Megatron-LM Backend](https://verl.readthedocs.io/en/latest/workers/megatron_workers.html)
+4. Setup Megatron (optional)
 
 ```bash
 # FOR Megatron-LM Backend
@@ -103,13 +133,14 @@ pip3 install git+https://github.com/NVIDIA/[email protected]
 # megatron core v0.4.0
 cd ..
 git clone -b core_v0.4.0 https://github.com/NVIDIA/Megatron-LM.git
-cd Megatron-LM
-cp ../verl/patches/megatron_v4.patch .
-git apply megatron_v4.patch
+cp verl/patches/megatron_v4.patch Megatron-LM/
+cd Megatron-LM && git apply megatron_v4.patch
 pip3 install -e .
 export PYTHONPATH=$PYTHONPATH:$(pwd)
 ```
 
+</details>
+
 ## Getting Started
 Visit our [documentation](https://verl.readthedocs.io/en/latest/index.html) to learn more.
 

diff --git a/docker/Dockerfile.ngc.vllm b/docker/Dockerfile.ngc.vllm
@@ -0,0 +1,31 @@
+FROM nvcr.io/nvidia/pytorch:24.05-py3
+
+# uninstall nv-pytorch fork
+RUN pip3 uninstall pytorch-quantization \
+     pytorch-triton \
+     torch \
+     torch-tensorrt \
+     torchvision \
+     xgboost transformer_engine flash_attn \
+     apex megatron-core -y
+
+RUN pip3 install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
+
+# make sure torch version is kept
+RUN pip3 install --no-cache-dir \
+    "torch==2.4.0" \
+    accelerate \
+    codetiming \
+    datasets \
+    dill \
+    hydra-core \
+    numpy \
+    pybind11 \
+    tensordict \
+    "transformers<=4.46.0"
+
+# ray is installed via vllm
+RUN pip3 install --no-cache-dir vllm==0.6.3
+
+# we choose flash-attn v2.7.0 or v2.7.2 which contain pre-built wheels
+RUN pip3 install --no-cache-dir --no-build-isolation flash-attn==2.7.0.post2
diff --git a/docker/Dockerfile.vemlp.vllm.te b/docker/Dockerfile.vemlp.vllm.te
@@ -0,0 +1,41 @@
+# docker buildx build --platform linux/x86_64 -t "verlai/verl:$TAG" -f docker/$FILE .
+
+# the one in docker.io is an alias for the one veturbo
+# FROM vemlp-cn-beijing.cr.volces.com/veturbo/pytorch:2.4-cu124
+FROM docker.io/haibinlin/verl:v0.0.5-th2.4.0-cu124-base
+
+# only config pip index with https://pypi.tuna.tsinghua.edu.cn/simple if needed
+# unset for now
+RUN pip3 config unset global.index-url
+
+# transformers 4.47.0 contains the following bug:
+# AttributeError: 'Gemma2Attention' object has no attribute '_flash_attn_uses_top_left_mask'
+RUN pip3 install --no-cache-dir \
+    torch==2.4.0 \
+    accelerate \
+    codetiming \
+    dill \
+    hydra-core \
+    numpy \
+    pybind11 \
+    tensordict \
+    "transformers <= 4.46.0"
+
+RUN pip3 install --no-cache-dir flash-attn==2.7.0.post2 --no-build-isolation
+
+# vllm depends on ray, and veRL does not support ray > 2.37
+RUN pip3 install --no-cache-dir vllm==0.6.3 ray==2.10
+
+# install apex
+RUN MAX_JOBS=4 pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
+    --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" \
+    git+https://github.com/NVIDIA/apex
+
+# install Transformer Engine
+# - flash-attn pinned to 2.5.3 by TransformerEngine, switch to eric-haibin-lin/[email protected] to relax version req
+# - install with: MAX_JOBS=1 NINJA_FLAGS="-j1" TE_BUILD_WITH_NINJA=0 to avoid OOM
+# - cudnn is required by TransformerEngine
+# RUN CUDNN_PATH=/opt/conda/lib/python3.11/site-packages/nvidia/cudnn \
+#     pip3 install git+https://github.com/eric-haibin-lin/[email protected]
+RUN MAX_JOBS=1 NINJA_FLAGS="-j1" pip3 install flash-attn==2.5.3 --no-cache-dir --no-build-isolation
+RUN MAX_JOBS=1 NINJA_FLAGS="-j1" pip3 install git+https://github.com/NVIDIA/TransformerEngine.git@v1.7