Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] feat: add more CI workflow #38

Merged
merged 73 commits into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
f4ee653
[ci] upload several tests
PeterSH6 Dec 6, 2024
008a73c
[ci] add sanity and tensordict utility workflow
PeterSH6 Dec 6, 2024
7aa51c0
[ci] fix workflow
PeterSH6 Dec 6, 2024
3c729fd
try fix import ci
PeterSH6 Dec 6, 2024
7425b36
[dataproto] update repeat and unpad/pad
PeterSH6 Dec 7, 2024
23cffb4
fix rollout test to 2GPU
PeterSH6 Dec 7, 2024
7865830
merge master
PeterSH6 Jan 6, 2025
3357d99
add a fsdp vllm hybridengine script, which can be launched by torchrun
PeterSH6 Jan 6, 2025
8734166
fix import test
PeterSH6 Jan 6, 2025
de3b72e
merge master
PeterSH6 Jan 6, 2025
675bff5
update requirement.txt
PeterSH6 Jan 6, 2025
cc99c80
draft vllm fsdp test
PeterSH6 Jan 6, 2025
9756354
update label
PeterSH6 Jan 6, 2025
dc205f2
fix
PeterSH6 Jan 6, 2025
79ae3b0
upload conda
PeterSH6 Jan 7, 2025
2f2bf3d
test conda
PeterSH6 Jan 7, 2025
9bb2cbe
test ci
PeterSH6 Jan 7, 2025
6efecc4
use docker
PeterSH6 Jan 7, 2025
a4bd1cd
test ci
PeterSH6 Jan 7, 2025
1b713fb
test ci
PeterSH6 Jan 7, 2025
5e173a4
test ci
PeterSH6 Jan 7, 2025
5f541cb
update ci
PeterSH6 Jan 7, 2025
3e63cdd
test ci
PeterSH6 Jan 7, 2025
e762e98
fix model loader
PeterSH6 Jan 7, 2025
35f3b74
fix model loader
PeterSH6 Jan 7, 2025
3c2f36e
test ci
PeterSH6 Jan 7, 2025
9f8e7db
test
PeterSH6 Jan 7, 2025
d47180a
upload e2e digit completion test
PeterSH6 Jan 8, 2025
902bbde
update running script for e2e test
PeterSH6 Jan 8, 2025
bfc8adc
update test config
PeterSH6 Jan 8, 2025
870f938
fix path
PeterSH6 Jan 8, 2025
683fbdc
test
PeterSH6 Jan 8, 2025
4332f57
fix import to register autotokenizer
PeterSH6 Jan 8, 2025
e1a6a5b
fix tokenizer
PeterSH6 Jan 8, 2025
0e079e6
fix create dataset
PeterSH6 Jan 8, 2025
175845a
fix
PeterSH6 Jan 8, 2025
1f915aa
fix reward model validate
PeterSH6 Jan 8, 2025
331bf9c
fix reward module of digit_completion
PeterSH6 Jan 8, 2025
8024c87
fix reward module of digit_completion
PeterSH6 Jan 8, 2025
ac8f4ee
fix reward module of digit_completion
PeterSH6 Jan 8, 2025
c66846f
fix reward module of digit_completion
PeterSH6 Jan 8, 2025
b7178c3
fix reward module of digit_completion
PeterSH6 Jan 8, 2025
42d4d79
can run but seems to have some test issue
PeterSH6 Jan 8, 2025
db6e8be
no problem, add check results
PeterSH6 Jan 8, 2025
a6829e6
add e2e training
PeterSH6 Jan 8, 2025
283aa2e
l20-0 seems has docker permission problem, test later
PeterSH6 Jan 8, 2025
9ebdfca
fix
PeterSH6 Jan 8, 2025
ed6d804
test l20-0 and torchrun
PeterSH6 Jan 8, 2025
c37aa78
test l20-0 and torchrun
PeterSH6 Jan 8, 2025
ba84d4c
fix
PeterSH6 Jan 8, 2025
e69c648
fix
PeterSH6 Jan 8, 2025
dc9563d
fix
PeterSH6 Jan 8, 2025
b5371b8
fix
PeterSH6 Jan 8, 2025
bccdb0e
fix
PeterSH6 Jan 8, 2025
6d9b85f
tolerate difference
PeterSH6 Jan 8, 2025
dfa3cb7
tolerate difference with levenshtein
PeterSH6 Jan 8, 2025
705d10a
lint
PeterSH6 Jan 8, 2025
dc78938
add more test for ray
PeterSH6 Jan 8, 2025
c04fca8
delete
PeterSH6 Jan 8, 2025
3c92fc6
use docker on l20
PeterSH6 Jan 8, 2025
091bec8
use docker on l20
PeterSH6 Jan 8, 2025
4d74389
add upgrade
PeterSH6 Jan 8, 2025
2c84712
update ci
PeterSH6 Jan 8, 2025
c069f6b
delete code
PeterSH6 Jan 8, 2025
414f0f2
ignore test
PeterSH6 Jan 8, 2025
b011882
upgrade ray
PeterSH6 Jan 8, 2025
433a678
fix workerhelper method
PeterSH6 Jan 8, 2025
101ae14
lint
PeterSH6 Jan 8, 2025
526fd31
revert worker changes
PeterSH6 Jan 8, 2025
611a630
fix
PeterSH6 Jan 8, 2025
3e56953
fix
PeterSH6 Jan 8, 2025
7f5e1ac
fix
PeterSH6 Jan 8, 2025
b6ed2d5
fix worker missing func
PeterSH6 Jan 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .github/workflows/gpu_test.yml → .github/workflows/dataset.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: ray
name: dataset

on:
# Trigger the workflow on push or pull request,
Expand All @@ -8,13 +8,13 @@ on:
- main
paths:
- "**/*.py"
- .github/workflows/ray_test.yml
- .github/workflows/dataset.yml
pull_request:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/ray_test.yml
- .github/workflows/dataset.yml

jobs:
ray:
Expand All @@ -30,7 +30,7 @@ jobs:
run: |
[ ! -d "$HOME/verl-data" ] && git clone --depth 1 https://github.com/eric-haibin-lin/verl-data ~/verl-data
pytest -s -x tests/verl
- name: Running ray tests that need 2 GPUs
- name: Running ray test using cupy (move it to L20 when dockerfile ready)
run: |
cd tests/ray
pytest -s -x test_rvdz.py test_driverfunc_to_worker.py test_data_transfer.py test_colocated_workers.py test_check_worker_alive.py
pytest -s -x test_rvdz.py
38 changes: 38 additions & 0 deletions .github/workflows/e2e_gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: e2e_gpu

on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/e2e_gpu.yml
pull_request:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/e2e_gpu.yml

jobs:
e2e_gpu:
runs-on: [self-hosted, l20-1]
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1"
container:
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install -e .[test]
- name: Running digit completon e2e training tests on 8 L20 GPUs
run: |
bash tests/e2e/run_ray_trainer.sh
42 changes: 42 additions & 0 deletions .github/workflows/ray_test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: ray

on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/ray_test.yml
pull_request:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/ray_test.yml

jobs:
ray:
runs-on: [self-hosted, l20-0]
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip install hf_transfer
pip install -e .[test]
pip install --upgrade "ray>=2.40.0"
- name: Running ray tests that need 8 GPUs
run: |
cd tests/ray
pytest -s -x --ignore=test_check_worker_alive.py --ignore=test_rvdz.py .
39 changes: 39 additions & 0 deletions .github/workflows/sanity.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: sanity

on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/sanity.yml
pull_request:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/sanity.yml

jobs:
sanity:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install the current repository
run: |
pip install -e .[test]
- name: Run sanity test
run: |
pytest -s -x tests/sanity
- name: Run untility test
run: |
pytest -s -x tests/utility
42 changes: 42 additions & 0 deletions .github/workflows/vllm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: vllm

on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/vllm.yml
pull_request:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/vllm.yml

jobs:
vllm:
runs-on: [self-hosted, l20-0]
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install hf_transfer
pip3 install -e .[test]
pip3 install vllm==0.5.4
- name: Running vllm tests on 8 L20 GPUs
run: |
cd tests/rollout
torchrun --standalone --nnodes=1 --nproc_per_node=8 $(which pytest) -s test_vllm_hf_loader.py
2 changes: 1 addition & 1 deletion .github/workflows/yapf_format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install yapf
pip install --upgrade yapf
pip install toml==0.10.2
- name: Running yapf
run: |
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ datasets
dill
hydra-core
numpy
pandas
pybind11
ray
tensordict<0.6
Expand Down
Empty file added tests/e2e/__init__.py
Empty file.
46 changes: 46 additions & 0 deletions tests/e2e/arithmetic_sequence/data/create_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from tests.e2e.envs.digit_completion import DigitCompletion, generate_ground_truth_response
from torch.utils import data
import os

if __name__ == '__main__':
simple_task = DigitCompletion(max_number=9, max_diff=9, max_num_in_response=9)
all_prompts = simple_task.get_all_prompts()

# 21 * 6 * 4
train_data, test_data = data.random_split(all_prompts, lengths=[0.8, 0.2])
train_data = list(train_data)
test_data = list(test_data)

train_data = [[{'role': 'user', 'content': str(item)}] \
for item in train_data]
test_data = [[{'role': 'user', 'content': str(item)}] \
for item in test_data]

print(f'Size of train: {len(train_data)}, size of test: {len(test_data)}')

train_data = {'prompt': train_data}
test_data = {'prompt': test_data}

model_folder = os.path.join(os.path.dirname(os.path.abspath(__file__)))

import pandas as pd

train_data_frame = pd.DataFrame(train_data)
test_data_frame = pd.DataFrame(test_data)

train_data_frame.to_parquet(os.path.join(model_folder, 'train.parquet'))
test_data_frame.to_parquet(os.path.join(model_folder, 'test.parquet'))
Binary file added tests/e2e/arithmetic_sequence/data/test.parquet
Binary file not shown.
Binary file added tests/e2e/arithmetic_sequence/data/train.parquet
Binary file not shown.
29 changes: 29 additions & 0 deletions tests/e2e/arithmetic_sequence/model/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": null,
"eos_token_id": 1,
"hidden_act": "silu",
"hidden_size": 128,
"initializer_range": 0.02,
"intermediate_size": 344,
"max_position_embeddings": 2048,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 4,
"num_hidden_layers": 4,
"num_key_value_heads": 4,
"pad_token_id": 2,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.43.3",
"use_cache": true,
"vocab_size": 16
}
61 changes: 61 additions & 0 deletions tests/e2e/arithmetic_sequence/model/create_model_tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Create a random model and tokenizer for PPO training
"""

import torch
import os
from transformers import AutoModelForCausalLM, LlamaConfig, AutoTokenizer

from tests.e2e.envs.digit_completion import CharTokenizer

tokenizer = CharTokenizer(
characters=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ',', ':'],
model_max_length=2048,
chat_template=
"{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set role = message['role'] %}{{ message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ sep_token }}{% endif %}"
)

config = LlamaConfig(vocab_size=(tokenizer.vocab_size + 16 - 1) // 16 * 16,
hidden_size=128,
intermediate_size=344,
num_hidden_layers=4,
num_attention_heads=4,
num_key_value_heads=4,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id)

model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16)

model_folder = os.path.join(os.path.dirname(os.path.abspath(__file__)))
os.makedirs(model_folder, exist_ok=True)

model.save_pretrained(model_folder)

tokenizer_folder = model_folder
tokenizer.save_pretrained(tokenizer_folder)

load_tokenizer = AutoTokenizer.from_pretrained(tokenizer_folder)

chat = [{'role': 'user', 'content': '1,0:2,3'}]

load_tokenizer.padding_side = 'left'
print(
load_tokenizer.apply_chat_template(chat,
tokenize=True,
add_generation_prompt=True,
max_length=10,
padding='max_length'))
6 changes: 6 additions & 0 deletions tests/e2e/arithmetic_sequence/model/generation_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"eos_token_id": 1,
"pad_token_id": 2,
"transformers_version": "4.43.3"
}
Binary file not shown.
18 changes: 18 additions & 0 deletions tests/e2e/arithmetic_sequence/model/tokenizer_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"char_ords": [
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
44,
58
],
"model_max_length": 2048,
"chat_template": "{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set role = message['role'] %}{{ message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ sep_token }}{% endif %}"
}
Loading
Loading