Skip to content

Commit

Permalink
update config guide
Browse files Browse the repository at this point in the history
  • Loading branch information
lvhan028 committed Nov 13, 2023
1 parent c481654 commit fa668e5
Show file tree
Hide file tree
Showing 2 changed files with 125 additions and 42 deletions.
129 changes: 107 additions & 22 deletions docs/en/turbomind_config.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
# turbomind config
# TurboMind Config

TurboMind LMDeploy 的推理引擎,在用它推理 LLM 模型时,需要把输入模型转成 TurboMind 模型。在 TurboMind 的模型文件夹中,除模型权重外,TurboMind 模型还包括其他一些文件,其中最重要的是和推理性能息息相关的配置文件`triton_models/weights/config.ini`
TurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need to convert the input model into a TurboMind model. In the TurboMind model folder, besides model weight files, the TurboMind model also includes some other files, among which the most important is the configuration file `triton_models/weights/config.ini` that is closely related to inference performance.

`llama-2-7b-chat` 模型为例,它的`config.ini`内容如下:
If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-20-config) to familiarize yourself with the configuration details.

## TurboMind 1.0 config

Taking the `llama-2-7b-chat` model as an example, in TurboMind 1.0, its `config.ini` content is as follows:

```toml
[llama]
Expand All @@ -23,11 +27,10 @@ rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 32
max_context_token_num = 4
step_length = 1
max_batch_size = 64
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
Expand All @@ -36,44 +39,126 @@ use_dynamic_ntk = 0
use_logn_attn = 0
```

在这份配置中,可调参数为:
These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are **not modifiable**.

```toml
model_name = llama2
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
```

In the following sections, we will focus on introducing the inference parameters.

### data type

`weight_type` and `group_size` are the relevant parameters, which cannot be modified.

`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. At present, turbomind only supports `group_size = 128`.

### batch size

`max_batch_size` determines the max size of a batch during inference. In general, the larger the batch size is, the higher the throughput is. But make sure that `max_batch_size <= cache_max_entry_count`

### k/v cache size

TurboMind allocates k/v cache memory based on `session_len`, `cache_chunk_size`, and `cache_max_entry_count`.

- `session_len` denotes the maximum length of a sequence, i.e., the size of the context window.
- `cache_chunk_size` indicates the size of k/v sequences to be allocated when new sequences are added.
- `cache_max_entry_count` signifies the maximum number of k/v sequences that can be cached.

### kv int8 switch

When initiating 8bit k/v inference, it's necessary to modify the parameters `quant_policy` and `use_context_fmha`. Please refer to [kv int8](./kv_int8.md) for a guide.

### long context switch

By setting `use_dynamic_ntk = 1`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.

Regarding the principle of Dynamic NTK, please refer to:

1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
2. https://kexue.fm/archives/9675

You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.

## TurboMind 2.0 config

In TurboMind 2.0, The model attribute part in the config remains the same with TurboMind 1.0, while the inference parameters have changed. We still take the `llama-2-7b-chat` model as an example. In TurboMind 2.0, its config.ini content is as follows:

```toml
[llama]
model_name = llama2
tensor_para_size = 1
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
session_len = 4104
weight_type = fp16
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 64
max_context_token_num = 4
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
use_dynamic_ntk = 0
use_logn_attn = 0
```

## 调节 batch
### data type

The same as in TurboMind 1.0

### batch size

`max_batch_size`表示推理时最大的 batch 数量
The maximum batch size is still set through `max_batch_size`. But its default value has been changed from 32 to 64, and `max_batch_size` is no longer related to `cache_max_entry_count`.

## 调节 k/v cache
### k/v cache size

k/v cache的内存可通过`cache_block_seq_len``cache_max_entry_count`调节。
k/v cache memory is determined by `cache_block_seq_len` and `cache_max_entry_count`.

TurboMind 2.0 实现了 Paged Attention,按块管理 k/v cache
TurboMind 2.0 has implemented Paged Attention, managing the k/v cache in blocks.

`cache_block_seq_len` 表示一块 k/v block 可以存放的 token 序列长度,默认 128TurboMind 按照以下公式计算 k/v block 的内存大小:
`cache_block_seq_len` represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:

```
cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type))
cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
```

对于 llama2-7b 模型来说,以 half 类型存放 k/v 时,一块 k/v block 的内存为:`128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`
For the llama2-7b model, when storing k/v as the `half` type, the memory of a k/v block is: `128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`

`cache_max_entry_count` 根据取值不同,表示不同的含义:
The meaning of `cache_max_entry_count` varies depending on its value:

- 当值为 (0, 1) 之间的小数时,`cache_max_entry_count` 表示 k/v block 使用的内存百分比。比如 A100-80G 显卡内存是80G,当`cache_max_entry_count`为0.5时,表示 k/v block 使用的内存为 80 * 0.5 = 40G
- 当值为 > 1的整数时,表示 k/v block 数量
- When it's a decimal between (0, 1), `cache_max_entry_count` represents the percentage of memory used by k/v blocks. For example, if turbomind launches on a A100-80G GPU with `cache_max_entry_count` being `0.5`, the total memory used by the k/v blocks is `80 * 0.5 = 40G`.
- When it's an integer no less than 1, it represents the number of k/v blocks

## KV-int8 开关
### kv int8 switch

`quant_policy = 4` 表示打开 KV-int8。使用这个功能时,请先参考 [kv int8](./kv_int8.md) 部署文档导出 KV 量化参数
The same as in TurboMind 1.0

## 外推能力开关
### long context switch

`use_dynamic_ntk``use_logn_attn`和模型的外推能力相关。
The same as in TurboMind 1.0
38 changes: 18 additions & 20 deletions docs/zh_cn/turbomind_config.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,40 @@
# TurboMind Config
# TurboMind 配置

TurboMind 是 LMDeploy 的推理引擎,在用它推理 LLM 模型时,需要把输入模型转成 TurboMind 模型。在 TurboMind 的模型文件夹中,除模型权重外,TurboMind 模型还包括其他一些文件,其中最重要的是和推理性能息息相关的配置文件`triton_models/weights/config.ini`

如果你使用的是 LMDeploy 0.0.x 版本,请参考[turbomind 1.0 config](#turbomind-10-config)章节,了解配置中的相关内容。如果使用的是 LMDeploy 0.1.x 版本,请阅读[turbomind 2.0 config](#turbomind-20-config)了解配置细节。
如果你使用的是 LMDeploy 0.0.x 版本,请参考[turbomind 1.0 配置](#turbomind-10-配置)章节,了解配置中的相关内容。如果使用的是 LMDeploy 0.1.x 版本,请阅读[turbomind 2.0 配置](#turbomind-20-配置)了解配置细节。

## TurboMind 1.0 config
## TurboMind 1.0 配置

`llama-2-7b-chat` 模型为例,在 TurboMind 1.0 中,它的`config.ini`内容如下:

```toml
[llama]
model_name = llama2
tensor_para_size = 1
head_num = 32
kv_head_num = 32
size_per_head = 128
vocab_size = 32000
num_layer = 32
rotary_embedding = 128
rope_theta = 10000.0
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
session_len = 4104
weight_type = fp16
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 32
max_context_token_num = 4
session_len = 4104
step_length = 1
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
tensor_para_size = 1
max_position_embeddings = 0
max_position_embeddings = 2048
use_dynamic_ntk = 0
use_logn_attn = 0
```
Expand All @@ -45,16 +45,16 @@ use_logn_attn = 0
model_name = llama2
head_num = 32
kv_head_num = 32
size_per_head = 128
vocab_size = 32000
num_layer = 32
rotary_embedding = 128
rope_theta = 10000.0
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
```

在接下来的章节中,我们重点介绍推理参数。
Expand All @@ -73,13 +73,13 @@ end_id = 2

TurboMind 根据 `session_len``cache_chunk_size``cache_max_entry_count` 开辟 k/v cache 内存。

`session_len` 表示一个序列的最大长度,即 context window 的大小。
`cache_chunk_size` 表示当需要开辟新的内存时,每次要开辟多少个序列的 k/v cache
`cache_max_entry_count` 表示最多缓存多少个对话序列
- `session_len` 表示一个序列的最大长度,即 context window 的大小。
- `cache_chunk_size` 表示当新增对话序列时,每次要开辟多少个序列的 k/v cache
- `cache_max_entry_count` 表示最多缓存多少个对话序列

### kv int8 开关

当启动 8bit k/v 推理时,需要把 `quant_policy``use_context_fmha` 分别改成 4、0。在此之前,请务必参考 [kv int8](./kv_int8.md) 部署文档,导出 k/v int8 的量化参数
当启动 8bit k/v 推理时,需要修改参数 `quant_policy``use_context_fmha`。详细内容请查阅 [kv int8](./kv_int8.md) 部署文档。

### 外推能力开关

Expand All @@ -92,9 +92,9 @@ TurboMind 根据 `session_len`、 `cache_chunk_size` 和 `cache_max_entry_count`

设置 `use_logn_attn = 1`,可以开启 [LogN attention scaling](https://kexue.fm/archives/8823)

## TurboMind 2.0 config
## TurboMind 2.0 配置

TurboMind 升级到 2.0 之后,config中部分字段的含义发生了变化。以 `llama-2-7b-chat` 模型为例,在 TurboMind 2.0 中,它的`config.ini`内容如下:
TurboMind 2.0 config 中的模型属性部分和 1.0 一致,但推理参数发生了变化。在下文中,我们仍然使用 `llama-2-7b-chat` 模型的 config 为例,重点讲述推理参数的变化。在 TurboMind 2.0 中,`llama-2-7b-chat``config.ini` 内容如下:

```toml
[llama]
Expand Down Expand Up @@ -128,8 +128,6 @@ use_dynamic_ntk = 0
use_logn_attn = 0
```

其中,模型属性部分和 turbomind 1.0 一致,推理参数发生了变更。

### 数据类型

与 turbomind 1.0 一致
Expand Down

0 comments on commit fa668e5

Please sign in to comment.