Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added training scripts Independent of Hugging Face's Trainer #27

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
c5b6618
Add [train_ds]: initial commit for train_ds.py
chizuchizu Oct 19, 2023
9974578
WIP update [train_ds]: cp from VisualChat
chizuchizu Oct 19, 2023
9f32e64
rm comment and argparser, set fire
chizuchizu Oct 19, 2023
5c82169
update [train_ds.py, config]
chizuchizu Oct 19, 2023
62c4ed5
fix [train_ds.py]: set to half the input batch and eval model input
chizuchizu Oct 19, 2023
62cb28d
add [utils.py]: for train_dspy
chizuchizu Oct 23, 2023
2ac2834
update [train_ds]: fix input, rm config element
chizuchizu Oct 23, 2023
ffdf56a
fix [train_ds.py]: saving model
chizuchizu Oct 25, 2023
dae2e3b
add wandb [train_ds.py]
chizuchizu Oct 25, 2023
8043bbf
fix avg loss metric
chizuchizu Oct 25, 2023
0050f85
update config for train_ds.py[exp_002.yml]
chizuchizu Oct 25, 2023
555a1c7
update [train_ds.py]
chizuchizu Oct 28, 2023
103737c
update train_ds.py
chizuchizu Oct 30, 2023
97edccb
fix print -> print_rank_0[train_ds.py]
chizuchizu Oct 31, 2023
ceda701
add simple coco (captioning) dataset
chizuchizu Oct 31, 2023
102fbca
add progressbar, fix model save point(rm merging LoRA while training)
chizuchizu Oct 31, 2023
3ac13c5
add beta to config
chizuchizu Oct 31, 2023
1c0a7ea
restore utils.py
chizuchizu Oct 31, 2023
51695a2
rm comment
chizuchizu Oct 31, 2023
0cb821e
Jp -> En [comment]
chizuchizu Nov 5, 2023
2aade14
Log lr to wandb
chizuchizu Nov 5, 2023
02289e1
support full parameter tuning
chizuchizu Nov 5, 2023
9f02ae7
add license
chizuchizu Nov 5, 2023
281d1e0
change license
chizuchizu Nov 5, 2023
debe6be
add DeepSpeedExamples to acknowledge
chizuchizu Nov 5, 2023
4464c10
change: coco.yaml -> m3it_coco.yaml
chizuchizu Nov 6, 2023
b675e82
add notice, copyright
chizuchizu Nov 6, 2023
45ef591
add todo (merge LoRA)
chizuchizu Nov 6, 2023
d9d4635
add uses [README]
chizuchizu Nov 6, 2023
f1acdba
fix path
chizuchizu Nov 6, 2023
73a02b2
fix calc loss stepwise
chizuchizu Nov 7, 2023
45563b3
fix [exp002_ds.yml]
chizuchizu Nov 7, 2023
50110a8
chore typo
chizuchizu Nov 7, 2023
d06d5e4
rm redundent saving model
chizuchizu Nov 7, 2023
1c31aee
adapt the saving model structure to original
chizuchizu Nov 7, 2023
d261c8c
Fix: all reduce logic
chizuchizu Nov 13, 2023
1354b33
add notice (applied format)
chizuchizu Nov 13, 2023
77a7fe0
rm lora
chizuchizu Nov 17, 2023
b4b59ea
add ZeRO-3 instruction
chizuchizu Nov 20, 2023
9342e04
add JP docs
chizuchizu Nov 20, 2023
3ec6037
add chinese docs
chizuchizu Nov 20, 2023
92a422e
fix diff
chizuchizu Nov 20, 2023
d7b0f68
Revert "fix diff"
chizuchizu Nov 20, 2023
ad4a19b
comment lora config
chizuchizu Nov 20, 2023
b693714
update README
chizuchizu Nov 20, 2023
d92c798
save model trained by ZeRO-3
chizuchizu Nov 24, 2023
74d9696
rm conversion [README]
chizuchizu Nov 24, 2023
f8aecb3
add HfDeepSpeedConfig on ZeRO-3 training
chizuchizu Nov 29, 2023
c741879
fix zero stage
chizuchizu Dec 1, 2023
2ebd456
add initialization for mpirun
chizuchizu Dec 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,74 @@ To start learning, execute the following command.

GPU is required for learning; we have tested on Ubuntu 20.04, CUDA 11.7.

# Training (w/o Trainer)

We offer `train_ds.py`, a training script independent of Hugging Face's `Trainer` class for more flexible learning configurations.
For example, the contents of [projects/opt/exp002_ds.yml](projects/opt/exp002_ds.yml) has the following contents:

```yaml
training_config:
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 4
num_train_epochs: 5
dataloader_num_workers: 16
learning_rate: 5.0e-5
output_dir: ./output/
report_to: "wandb"
zero_stage: 2
precision: "fp16"
enable_tensorboard: False
seed: 0
weight_decay: 0.
learning_rate_pretraining_components: 0.
num_warmup_steps: 0.
optim_betas:
- 0.9
- 0.95
lr_scheduler_type: "cosine"
gradient_checkpointing: False
cpu_offload: False


model_config:
pretrained_path: # None or path to model weight
model_type: git_llm
language_model_name: facebook/opt-125m
vision_model_name: openai/clip-vit-base-patch16
num_image_with_embedding: 1 # if 1, no img_temporal_embedding
max_length: 512
keys_to_finetune:
- visual_projection
- num_image_with_embedding
keys_to_freeze: []

# TODO: support LoRA
# use_lora: false
# lora:
# r: 8
# lora_alpha: 32
# target_modules:
# - q_proj
# - k_proj
# - v_proj
# lora_dropout: 0.01
# bias: none
# task_type: CAUSAL_LM

dataset_config_path:
- ./configs/datasets/m3it_coco.yaml # only coco dataset
```

To start learning, execute the following command.

```bash
./scripts/run_ds.sh
```

# Evaluation

If you have the model trained by ZeRO-3
You can get the pretrained weight form Hugging Face Hub: [turing-motors/heron-chat-git-ja-stablelm-base-7b-v0](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v0)<br>
See also [notebooks](./notebooks).

Expand Down Expand Up @@ -197,6 +263,26 @@ with torch.no_grad():
print(processor.tokenizer.batch_decode(out)[0])
```

If you have a model trained using ZeRO-3, it must be modified as follows:

```diff
- # prepare a pretrained model
- model = GitLlamaForCausalLM.from_pretrained(
- 'turing-motors/heron-chat-git-Llama-2-7b-v0', torch_dtype=torch.float16
- )
+ from heron.models.utils import load_model, load_pretrained_weight
+ import yaml
+
+ config_file = f"./projects/opt/exp002_ds.yml"
+
+ # get config
+ with open(config_file, "r") as i_:
+ config = yaml.safe_load(i_)
+
+ model = load_model(config["model_config"])
+ model.load_state_dict(torch.load('./output/opt/exp002_ds/epoch-1/pytorch_model.bin'), strict=True)
```

### Pretrained Models

|model|LLM module|adapter|size|
Expand Down Expand Up @@ -225,3 +311,4 @@ Released under the [Apache License 2.0](./LICENSE).
- [GenerativeImage2Text](https://github.com/microsoft/GenerativeImage2Text): The main idia of the model is based on original GIT.
- [Llava](https://github.com/haotian-liu/LLaVA): This project is learned a lot from the great Llava project.
- [GIT-LLM](https://github.com/Ino-Ichan/GIT-LLM)
- [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples)
3 changes: 3 additions & 0 deletions configs/datasets/m3it_coco.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dataset_type: m3it
dataset_names:
- coco
85 changes: 85 additions & 0 deletions docs/README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,70 @@ training_config "为训练设置, "model_config "为模型设置,"dataset_conf

学习需要 GPU;我们在 Ubuntu 20.04 和 CUDA 11.7 上对系统进行了测试.

# 学习方法 (不含 Trainer)
我们提供 `train_ds.py` ,一个独立于Hugging Face训练师类的训练脚本,用于更灵活的学习配置。例如,[projects/opt/exp002_ds.yml](../projects/opt/exp002_ds.yml) 的内容如下:

```yaml
training_config:
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 4
num_train_epochs: 5
dataloader_num_workers: 16
learning_rate: 5.0e-5
output_dir: ./output/
report_to: "wandb"
zero_stage: 2
precision: "fp16"
enable_tensorboard: False
seed: 0
weight_decay: 0.
learning_rate_pretraining_components: 0.
num_warmup_steps: 0.
optim_betas:
- 0.9
- 0.95
lr_scheduler_type: "cosine"
gradient_checkpointing: False
cpu_offload: False


model_config:
pretrained_path: # None or path to model weight
model_type: git_llm
language_model_name: facebook/opt-125m
vision_model_name: openai/clip-vit-base-patch16
num_image_with_embedding: 1 # if 1, no img_temporal_embedding
max_length: 512
keys_to_finetune:
- visual_projection
- num_image_with_embedding
keys_to_freeze: []

# TODO: support LoRA
# use_lora: false
# lora:
# r: 8
# lora_alpha: 32
# target_modules:
# - q_proj
# - k_proj
# - v_proj
# lora_dropout: 0.01
# bias: none
# task_type: CAUSAL_LM

dataset_config_path:
- ./configs/datasets/m3it_coco.yaml # only coco dataset
```

要开始学习, 请执行以下命令.


```bash
./scripts/run_ds.sh
```

# 如何使用

您可以从 Hugging Face Hub 下载训练好的模型:[turing-motors/heron-chat-git-ja-stablelm-base-7b-v0](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v0)<br>
Expand Down Expand Up @@ -195,6 +259,26 @@ with torch.no_grad():
print(processor.tokenizer.batch_decode(out))
```

如果模型是用 ZeRO-3 训练的,请进行以下更改.

```diff
- # prepare a pretrained model
- model = GitLlamaForCausalLM.from_pretrained(
- 'turing-motors/heron-chat-git-Llama-2-7b-v0', torch_dtype=torch.float16
- )
+ from heron.models.utils import load_model, load_pretrained_weight
+ import yaml
+
+ config_file = f"./projects/opt/exp002_ds.yml"
+
+ # get config
+ with open(config_file, "r") as i_:
+ config = yaml.safe_load(i_)
+
+ model = load_model(config["model_config"])
+ model.load_state_dict(torch.load('./output/opt/exp002_ds/epoch-1/pytorch_model.bin'), strict=True)
```

### 训练有素的模型列表

|model|LLM module|adapter|size|
Expand Down Expand Up @@ -222,3 +306,4 @@ print(processor.tokenizer.batch_decode(out))
- [GenerativeImage2Text](https://github.com/microsoft/GenerativeImage2Text)
- [Llava](https://github.com/haotian-liu/LLaVA)
- [GIT-LLM](https://github.com/Ino-Ichan/GIT-LLM)
- [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples)
85 changes: 85 additions & 0 deletions docs/README_JP.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,70 @@ dataset_config_path:

学習にはGPUが必要です。Ubuntu20.04, CUDA11.7で動作確認をしています。

# 学習方法 (Trainerなし)
Hugging Faceの `Trainer` クラスに依存しない訓練スクリプト `train_ds.py` を提供しています。<br>
例えば、[projects/opt/exp002_ds.yml](../projects/opt/exp_002_ds.yml)の内容は次のようになっています。

```yaml
training_config:
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 4
num_train_epochs: 5
dataloader_num_workers: 16
learning_rate: 5.0e-5
output_dir: ./output/
report_to: "wandb"
zero_stage: 2
precision: "fp16"
enable_tensorboard: False
seed: 0
weight_decay: 0.
learning_rate_pretraining_components: 0.
num_warmup_steps: 0.
optim_betas:
- 0.9
- 0.95
lr_scheduler_type: "cosine"
gradient_checkpointing: False
cpu_offload: False


model_config:
pretrained_path: # None or path to model weight
model_type: git_llm
language_model_name: facebook/opt-125m
vision_model_name: openai/clip-vit-base-patch16
num_image_with_embedding: 1 # if 1, no img_temporal_embedding
max_length: 512
keys_to_finetune:
- visual_projection
- num_image_with_embedding
keys_to_freeze: []

# TODO: support LoRA
# use_lora: false
# lora:
# r: 8
# lora_alpha: 32
# target_modules:
# - q_proj
# - k_proj
# - v_proj
# lora_dropout: 0.01
# bias: none
# task_type: CAUSAL_LM

dataset_config_path:
- ./configs/datasets/m3it_coco.yaml # only coco dataset
```

学習を開始する場合は、次のコマンドを実行してください。

```bash
./scripts/run_ds.sh
```

# 利用方法

Hugging Face Hubから学習済みモデルをダウンロードすることができます: [turing-motors/heron-chat-git-ja-stablelm-base-7b-v0](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v0)<br>
Expand Down Expand Up @@ -194,6 +258,26 @@ with torch.no_grad():
print(processor.tokenizer.batch_decode(out))
```

もしZeRO-3で訓練されたモデルならば、推論用コードに次の変更を加えます。

```diff
- # prepare a pretrained model
- model = GitLlamaForCausalLM.from_pretrained(
- 'turing-motors/heron-chat-git-Llama-2-7b-v0', torch_dtype=torch.float16
- )
+ from heron.models.utils import load_model, load_pretrained_weight
+ import yaml
+
+ config_file = f"./projects/opt/exp002_ds.yml"
+
+ # get config
+ with open(config_file, "r") as i_:
+ config = yaml.safe_load(i_)
+
+ model = load_model(config["model_config"])
+ model.load_state_dict(torch.load('./output/opt/exp002_ds/epoch-1/pytorch_model.bin'), strict=True)
```

### 学習済みモデル一覧

|model|LLM module|adapter|size|
Expand Down Expand Up @@ -221,3 +305,4 @@ print(processor.tokenizer.batch_decode(out))
- [GenerativeImage2Text](https://github.com/microsoft/GenerativeImage2Text): モデルの構成方法の着想はGITに基づいています。
- [Llava](https://github.com/haotian-liu/LLaVA): 本ライブラリはLlavaプロジェクトを参考にしています。
- [GIT-LLM](https://github.com/Ino-Ichan/GIT-LLM)
- [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples)
5 changes: 4 additions & 1 deletion heron/models/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,9 @@ def set_trainable_params(
untrainable_list.append(name)

else:
raise ValueError("either keys_to_freeze or keys_to_finetune should be specified")
# Full parameter Tuning
for name, p in model.named_parameters():
p.requires_grad = True
trainable_list.append(name)

return trainable_list, untrainable_list
Loading