Skip to content

Commit 9236a4a

Browse files
update with rename
1 parent 4543517 commit 9236a4a

File tree

12 files changed

+40
-41
lines changed

12 files changed

+40
-41
lines changed

demo/intel_device_demo/itrex/itrex_cli_demo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import os
66

77

8-
MODEL_PATH = os.environ.get("MODEL_PATH", "THUDM/GLM-4-9B-Chat-0414")
8+
MODEL_PATH = os.environ.get("MODEL_PATH", "THUDM/GLM-4-9B-0414")
99

1010
from threading import Thread
1111

demo/intel_device_demo/openvino/convert.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
parser = argparse.ArgumentParser(add_help=False)
1717
parser.add_argument("-h", "--help", action="help", help="Show this help message and exit.")
1818
parser.add_argument(
19-
"-m", "--model_id", default="THUDM/GLM-4-9B-Chat-0414", required=False, type=str, help="orignal model path"
19+
"-m", "--model_id", default="THUDM/GLM-4-9B-0414", required=False, type=str, help="orignal model path"
2020
)
2121
parser.add_argument(
2222
"-p",

finetune/README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ All fine-tuning tests were performed in the following environment:
2020
2121
+ Fine-tuning based on Llama-Factory
2222

23-
| Fine-tuning Model | Fine-tuning solution | GPU memory usage |
24-
|---------------------------|----------------------|------------------------------|
25-
| GLM-4-9B-Chat-0414 | lora | 22G (Each GPU, Need 1 GPU) |
26-
| GLM-4-9B-Chat-0414 | SFT (Zero3 method) | 55G (Each GPU, Need 4 GPUs) |
27-
| GLM-4-9B-Chat-0414 | lora | 80G (Each GPU, Need 8 GPUs) |
28-
| GLM-4-32B-Chat-0414 | SFT (Zero3 method) | 80G (Each GPU, Need 16 GPUs) |
23+
| Fine-tuning Model | Fine-tuning solution | GPU memory usage |
24+
|-----------------------|----------------------|------------------------------|
25+
| GLM-4-9B-0414 | lora | 22G (Each GPU, Need 1 GPU) |
26+
| GLM-4-9B-0414 | SFT (Zero3 method) | 55G (Each GPU, Need 4 GPUs) |
27+
| GLM-4-9B-0414 | lora | 80G (Each GPU, Need 8 GPUs) |
28+
| GLM-4-32B-0414 | SFT (Zero3 method) | 80G (Each GPU, Need 16 GPUs) |
2929

3030
+ Fine-tuning based on this repository
3131

@@ -38,7 +38,7 @@ All fine-tuning tests were performed in the following environment:
3838

3939
## Preparation
4040

41-
Before starting fine-tuning, please install the dependencies in \`basic_demo\`, ensure you have cloned the latest version of the model repository, and install the dependencies in this directory:
41+
Before starting fine-tuning, please install the dependencies in `inference`, ensure you have cloned the latest version of the model repository, and install the dependencies in this directory:
4242

4343
```bash
4444
pip install -r requirements.txt
@@ -261,14 +261,14 @@ Execute **single machine multi-card/multi-machine multi-card** run through the f
261261
the acceleration solution, and you need to install `deepspeed`.
262262

263263
```shell
264-
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data/AdvertiseGen/ THUDM/GLM-4-9b-Chat-0414 configs/lora.yaml # For Chat Fine-tune
264+
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data/AdvertiseGen/ THUDM/GLM-4-9b-0414 configs/lora.yaml # For Chat Fine-tune
265265
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
266266
```
267267

268268
Execute **single machine single card** run through the following code.
269269

270270
```shell
271-
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-Chat-0414 configs/lora.yaml # For Chat Fine-tune
271+
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml # For Chat Fine-tune
272272
python finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
273273
```
274274

@@ -284,7 +284,7 @@ half-trained model, you can add a fourth parameter, which can be passed in two w
284284
For example, this is an example code to continue fine-tuning from the last saved point
285285

286286
```shell
287-
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-Chat-0414 configs/lora.yaml yes
287+
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml yes
288288
```
289289

290290
## Use the fine-tuned model

finetune/README_zh.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,12 @@ Read this in [English](README)
2121

2222
+ 基于 Llama-Factory 进行微调
2323

24-
| Fine-tuning Model | Fine-tuning solution | GPU memory usage |
25-
|---------------------------|----------------------|------------------------------|
26-
| GLM-4-9B-Chat-0414 | lora | 22G (Each GPU, Need 1 GPU) |
27-
| GLM-4-9B-Chat-0414 | SFT (Zero3 method) | 55G (Each GPU, Need 4 GPUs) |
28-
| GLM-4-9B-Chat-0414 | lora | 80G (Each GPU, Need 8 GPUs) |
29-
| GLM-4-32B-Chat-0414 | SFT (Zero3 method) | 80G (Each GPU, Need 16 GPUs) |
24+
| Fine-tuning Model | Fine-tuning solution | GPU memory usage |
25+
|-----------------------|----------------------|------------------------------|
26+
| GLM-4-9B-0414 | lora | 22G (Each GPU, Need 1 GPU) |
27+
| GLM-4-9B-0414 | SFT (Zero3 method) | 55G (Each GPU, Need 4 GPUs) |
28+
| GLM-4-9B-0414 | lora | 80G (Each GPU, Need 8 GPUs) |
29+
| GLM-4-32B-0414 | SFT (Zero3 method) | 80G (Each GPU, Need 16 GPUs) |
3030

3131
+ 基于本仓库代码微调
3232

@@ -261,14 +261,14 @@ pip install -r requirements.txt
261261
通过以下代码执行 **单机多卡/多机多卡** 运行,这是使用 `deepspeed` 作为加速方案的,您需要安装 `deepspeed`。接着,按照此命令运行:
262262

263263
```shell
264-
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-Chat-0414 configs/lora.yaml # For Chat Fine-tune
264+
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml # For Chat Fine-tune
265265
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
266266
```
267267

268268
通过以下代码执行 **单机单卡** 运行。
269269

270270
```shell
271-
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-Chat-0414 configs/lora.yaml # For Chat Fine-tune
271+
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml # For Chat Fine-tune
272272
python finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
273273
```
274274

@@ -282,7 +282,7 @@ python finetune_vision.py data/CogVLM-311K/ THUDM/glm-4v-9b configs/lora.yaml
282282
例如,这就是一个从最后一个保存点继续微调的示例代码
283283

284284
```shell
285-
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-Chat-0414 configs/lora.yaml yes
285+
python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml yes
286286
```
287287

288288
## 使用微调后的模型

inference/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Test Hardware:
2828

2929
The following stress test results show memory usage and latency during inference. If multiple GPUs are used, "Memory Usage" refers to the maximum usage on a single GPU.
3030

31-
#### GLM-4-32B-Chat-0414
31+
#### GLM-4-32B-0414
3232

3333
| Precision | #GPUs | Memory Usage | First Token Latency | Token Output Speed | Input Tokens |
3434
|-------------|-------|---------------|---------------------|-------------------|--------------|
@@ -37,7 +37,7 @@ The following stress test results show memory usage and latency during inference
3737
| BF16 | 2 | 50 GB | 6.75s | 8.1 tokens/s | 32000 |
3838
| BF16 | 4 | 55 GB | 37.83s | 3.0 tokens/s | 100000 |
3939

40-
#### GLM-4-9B-Chat-0414
40+
#### GLM-4-9B-0414
4141

4242
| Precision | #GPUs | Memory Usage | First Token Latency | Token Output Speed | Input Tokens |
4343
|-----------|-------|---------------|----------------------|---------------------|---------------|
@@ -71,35 +71,35 @@ The following stress test results show memory usage and latency during inference
7171
+ Use the command line to communicate with the GLM-4-9B model.
7272

7373
```shell
74-
python trans_cli_demo.py # LLM Such as GLM-4-9B-Chat-0414
74+
python trans_cli_demo.py # LLM Such as GLM-4-9B-0414
7575
python trans_cli_vision_demo.py # GLM-4V-9B
7676
```
7777

7878
+ Use the Gradio web client to communicate with the GLM-4-9B model.
7979

8080
```shell
81-
python trans_web_demo.py # LLM Such as GLM-4-9B-Chat-0414
81+
python trans_web_demo.py # LLM Such as GLM-4-9B-0414
8282
python trans_web_vision_demo.py # GLM-4V-9B
8383
```
8484

8585
+ Use Batch inference.
8686

8787
```shell
88-
python trans_batch_demo.py # LLM Such as GLM-4-9B-Chat-0414
88+
python trans_batch_demo.py # LLM Such as GLM-4-9B-0414
8989
```
9090

9191
### Use vLLM backend code
9292

9393
+ Use the command line to communicate with the GLM-4-9B-Chat model.
9494

9595
```shell
96-
python vllm_cli_demo.py # LLM Such as GLM-4-9B-Chat-0414
96+
python vllm_cli_demo.py # LLM Such as GLM-4-9B-0414
9797
```
9898

9999
+ Launch an OpenAI-compatible API service.
100100

101101
```shell
102-
vllm serve THUDM/GLM-4-9B-Chat-0414 --tensor_parallel_size 2
102+
vllm serve THUDM/GLM-4-9B-0414 --tensor_parallel_size 2
103103
```
104104

105105
### Use glm-4v to build an OpenAI-compatible service

inference/README_zh.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ pip install -r requirements.txt
2828

2929
推理的压力测试数据如下,如有多张显卡,则显存占用代表显存占用最大一张显卡的显存消耗。
3030

31-
#### GLM-4-32B-Chat-0414
31+
#### GLM-4-32B-0414
3232

3333
| 精度 | 显卡数量 | 显存占用 | 首 Token 延迟 | Token 输出速度 | 输入token数 |
3434
|------|------|-------|------------|---------------|----------|
@@ -37,7 +37,7 @@ pip install -r requirements.txt
3737
| BF16 | 2 | 50 GB | 6.75s | 8.1 tokens/s | 32000 |
3838
| BF16 | 4 | 55 GB | 37.83s | 3.0 tokens/s | 100000 |
3939

40-
#### GLM-4-9B-Chat-0414
40+
#### GLM-4-9B-0414
4141

4242
| 精度 | 显卡数量 | 显存占用 | 首 Token 延迟 | Token 输出速度 | 输入token数 |
4343
|------|------|-------|------------|---------------|---------|
@@ -72,14 +72,14 @@ pip install -r requirements.txt
7272
+ 使用命令行与 GLM-4-9B 模型进行对话。
7373

7474
```shell
75-
python trans_cli_demo.py # LLM Such as GLM-4-9B-Chat-0414
75+
python trans_cli_demo.py # LLM Such as GLM-4-9B-0414
7676
python trans_cli_vision_demo.py # GLM-4V-9B
7777
```
7878

7979
+ 使用 Gradio 网页端与 GLM-4-9B 模型进行对话。
8080

8181
```shell
82-
python trans_web_demo.py # LLM Such as GLM-4-9B-Chat-0414
82+
python trans_web_demo.py # LLM Such as GLM-4-9B-0414
8383
python trans_web_vision_demo.py # GLM-4V-9B
8484
```
8585

@@ -94,12 +94,12 @@ python trans_batch_demo.py
9494
+ 使用命令行与 GLM-4-9B-Chat 模型进行对话。
9595

9696
```shell
97-
python vllm_cli_demo.py # LLM Such as GLM-4-9B-Chat-0414
97+
python vllm_cli_demo.py # LLM Such as GLM-4-9B-0414
9898
```
9999

100100
+ 构建 OpenAI 类 API 服务。
101101
```shell
102-
vllm serve THUDM/GLM-4-9B-Chat-0414 --tensor_parallel_size 2
102+
vllm serve THUDM/GLM-4-9B-0414 --tensor_parallel_size 2
103103
```
104104

105105
### 使用 glm-4v 构建 OpenAI 服务

inference/glm4v_api_request.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
This script creates a OpenAI Request demo for the glm-4v-9b model, just Use OpenAI API to interact with the model.
33
For LLM such as GLM-4-9B-0414, using with vLLM OpenAI Server.
44
5-
vllm serve THUDM/GLM-4-32B-Chat-0414 --tensor_parallel_size 4
5+
vllm serve THUDM/GLM-4-32B-0414 --tensor_parallel_size 4
66
77
"""
88

inference/trans_batch_demo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessorList
1212

1313

14-
MODEL_PATH = "THUDM/GLM-4-9B-Chat-0414"
14+
MODEL_PATH = "THUDM/GLM-4-9B-0414"
1515

1616
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
1717
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto").eval()

inference/trans_cli_demo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
)
2626

2727

28-
MODEL_PATH = "THUDM/GLM-4-9B-Chat-0414"
28+
MODEL_PATH = "THUDM/GLM-4-9B-0414"
2929

3030
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
3131

inference/trans_stress_test.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
77

88

9-
MODEL_PATH = "THUDM/GLM-4-9B-Chat-0414"
9+
MODEL_PATH = "THUDM/GLM-4-9B-0414"
1010

1111

1212
def stress_test(input_token_len, n, output_token_len):

0 commit comments

Comments
 (0)