Skip to content

Commit

Permalink
[distro] feat: add docker support (#41)
Browse files Browse the repository at this point in the history
* [distro] feat: add docker support

* update docker tag

* update description
  • Loading branch information
eric-haibin-lin authored Dec 9, 2024
1 parent c592a8b commit 50ac725
Show file tree
Hide file tree
Showing 3 changed files with 137 additions and 34 deletions.
99 changes: 65 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
<div align=center>
<img src="docs/_static/logo.png" width = "20%" height = "20%" />
</div>

<h1 style="text-align: center;">veRL: Volcano Engine Reinforcement Learning for LLM</h1>

veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.
Expand Down Expand Up @@ -29,66 +25,100 @@ veRL is fast with:
<!-- <a href=""><b>Slides</b></a> | -->
</p>

## Installation Guide

Below are the steps to install veRL in your environment.

### Requirements
- **Python**: Version >= 3.9
- **CUDA**: Version >= 12.1

veRL supports various backends. Currently, the following configurations are available:
- **FSDP** and **Megatron-LM** for training.
- **vLLM** for rollout generation.

**Training backends**

We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in [PyTorch FSDP Backend](https://verl.readthedocs.io/en/latest/workers/fsdp_workers.html)

For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support Megatron-LM@core_v0.4.0 and we fix some internal issues of Megatron-LM. Here's the additional installation guide. The guide for using Megatron-LM backend can be found in [Megatron-LM Backend](https://verl.readthedocs.io/en/latest/workers/megatron_workers.html)

### Installation Options

#### 1. From Docker Image

## Installation
We provide pre-built Docker images for quick setup.

For installing the latest version of veRL, the best way is to clone and install it from source. Then you can modify our code to customize your own post-training jobs.
Image and tag: `verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3`

1. Launch the desired Docker image:

```bash
# install verl together with some lightweight dependencies in setup.py
git clone https://github.com/volcengine/verl.git
cd verl
pip3 install -e .
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v <image:tag>
```

You can also install veRL using `pip3 install`
2. Inside the container, install veRL:

```bash
# directly install from pypi
pip3 install verl
# install the nightly version
git clone https://github.com/volcengine/verl && cd verl && pip3 install -e .
# or install from pypi via `pip3 install verl`
```

### Dependencies
4. Setup Megatron (optional)

veRL requires Python >= 3.9 and CUDA >= 12.1.
If you want to enable training with Megatron, Megatron code must be added to PYTHONPATH:

veRL support various backend, we currently release FSDP and Megatron-LM for actor training and vLLM for rollout generation.
```bash
cd ..
git clone -b core_v0.4.0 https://github.com/NVIDIA/Megatron-LM.git
cp verl/patches/megatron_v4.patch Megatron-LM/
cd Megatron-LM && git apply megatron_v4.patch
pip3 install -e .
export PYTHONPATH=$PYTHONPATH:$(pwd)
```

You can also get the Megatron code after verl's patch via
```bash
git clone -b core_v0.4.0_verl https://github.com/eric-haibin-lin/Megatron-LM
```

#### 2. From Custom Environments

<details><summary>If you prefer setting up veRL in your custom environment, expand this section and follow the steps below.</summary>

Using **conda** is recommended for managing dependencies.

To install the dependencies, we recommend using conda:
1. Create a conda environment:

```bash
conda create -n verl python==3.9
conda activate verl
```

The following dependencies are required for all backends.
2. Install common dependencies (required for all backends)

```bash
# install torch [or you can skip this step and let vllm to install the correct version for you]
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

# install vllm
pip3 install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
pip3 install ray==2.10 # other version may have bug
pip3 install ray

# flash attention 2
pip3 install flash-attn --no-build-isolation
```

**FSDP**
3. Install veRL

We recommend using FSDP backend to investigate, research and prototype different models, datasets and RL algorithms.

The pros, cons and extension guide for using FSDP backend can be found in [PyTorch FSDP Backend](https://verl.readthedocs.io/en/latest/workers/fsdp_workers.html)

**Megatron-LM**

For users who pursue better scalability, we recommend using Megatron-LM backend. Please install the above dependencies first.

Currently, we support Megatron-LM@core_v0.4.0 and we fix some internal issues of Megatron-LM. Here's the additional installation guide.
```bash
# install the nightly version
git clone https://github.com/volcengine/verl && cd verl && pip3 install -e .
# or install from pypi via `pip3 install verl`
```

The pros, cons and extension guide for using Megatron-LM backend can be found in [Megatron-LM Backend](https://verl.readthedocs.io/en/latest/workers/megatron_workers.html)
4. Setup Megatron (optional)

```bash
# FOR Megatron-LM Backend
Expand All @@ -103,13 +133,14 @@ pip3 install git+https://github.com/NVIDIA/[email protected]
# megatron core v0.4.0
cd ..
git clone -b core_v0.4.0 https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
cp ../verl/patches/megatron_v4.patch .
git apply megatron_v4.patch
cp verl/patches/megatron_v4.patch Megatron-LM/
cd Megatron-LM && git apply megatron_v4.patch
pip3 install -e .
export PYTHONPATH=$PYTHONPATH:$(pwd)
```

</details>

## Getting Started
Visit our [documentation](https://verl.readthedocs.io/en/latest/index.html) to learn more.

Expand Down
31 changes: 31 additions & 0 deletions docker/Dockerfile.ngc.vllm
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
FROM nvcr.io/nvidia/pytorch:24.05-py3

# uninstall nv-pytorch fork
RUN pip3 uninstall pytorch-quantization \
pytorch-triton \
torch \
torch-tensorrt \
torchvision \
xgboost transformer_engine flash_attn \
apex megatron-core -y

RUN pip3 install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124

# make sure torch version is kept
RUN pip3 install --no-cache-dir \
"torch==2.4.0" \
accelerate \
codetiming \
datasets \
dill \
hydra-core \
numpy \
pybind11 \
tensordict \
"transformers<=4.46.0"

# ray is installed via vllm
RUN pip3 install --no-cache-dir vllm==0.6.3

# we choose flash-attn v2.7.0 or v2.7.2 which contain pre-built wheels
RUN pip3 install --no-cache-dir --no-build-isolation flash-attn==2.7.0.post2
41 changes: 41 additions & 0 deletions docker/Dockerfile.vemlp.vllm.te
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# docker buildx build --platform linux/x86_64 -t "verlai/verl:$TAG" -f docker/$FILE .

# the one in docker.io is an alias for the one veturbo
# FROM vemlp-cn-beijing.cr.volces.com/veturbo/pytorch:2.4-cu124
FROM docker.io/haibinlin/verl:v0.0.5-th2.4.0-cu124-base

# only config pip index with https://pypi.tuna.tsinghua.edu.cn/simple if needed
# unset for now
RUN pip3 config unset global.index-url

# transformers 4.47.0 contains the following bug:
# AttributeError: 'Gemma2Attention' object has no attribute '_flash_attn_uses_top_left_mask'
RUN pip3 install --no-cache-dir \
torch==2.4.0 \
accelerate \
codetiming \
dill \
hydra-core \
numpy \
pybind11 \
tensordict \
"transformers <= 4.46.0"

RUN pip3 install --no-cache-dir flash-attn==2.7.0.post2 --no-build-isolation

# vllm depends on ray, and veRL does not support ray > 2.37
RUN pip3 install --no-cache-dir vllm==0.6.3 ray==2.10

# install apex
RUN MAX_JOBS=4 pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
--config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" \
git+https://github.com/NVIDIA/apex

# install Transformer Engine
# - flash-attn pinned to 2.5.3 by TransformerEngine, switch to eric-haibin-lin/[email protected] to relax version req
# - install with: MAX_JOBS=1 NINJA_FLAGS="-j1" TE_BUILD_WITH_NINJA=0 to avoid OOM
# - cudnn is required by TransformerEngine
# RUN CUDNN_PATH=/opt/conda/lib/python3.11/site-packages/nvidia/cudnn \
# pip3 install git+https://github.com/eric-haibin-lin/[email protected]
RUN MAX_JOBS=1 NINJA_FLAGS="-j1" pip3 install flash-attn==2.5.3 --no-cache-dir --no-build-isolation
RUN MAX_JOBS=1 NINJA_FLAGS="-j1" pip3 install git+https://github.com/NVIDIA/TransformerEngine.git@v1.7

0 comments on commit 50ac725

Please sign in to comment.