update guide

lvhan028 · Nov 17, 2023 · 2c1a466 · 2c1a466
1 parent 05ac201
commit 2c1a466
Show file tree

Hide file tree

Showing 5 changed files with 139 additions and 88 deletions.
diff --git a/README.md b/README.md
@@ -20,12 +20,7 @@ ______________________________________________________________________
 
 ## News 🎉
 
-- \[2023/11\] Turbomind has been upgraded to version 2.0, including the following features:
-  - Paged Attention
-  - Faster attention kernels with no limitation on max sequence length
-  - Faster KV8 kernels (like 2x faster)
-  - Split-K decoding (Flash Decoding)
-  - W4A16 inference for sm_75
+- \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
 - \[2023/09\] TurboMind supports Qwen-14B
 - \[2023/09\] TurboMind supports InternLM-20B
 - \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -20,12 +20,7 @@ ______________________________________________________________________
 
 ## 更新 🎉
 
-- \[2023/11\] Turbomind 重磅升级:
-  - Paged Attention
-  - 更快的 attention kernel，不再受序列最大长度限制
-  - 更快的 KV8 kernels (提升2倍多)
-  - Split-K decoding (Flash Decoding)
-  - W4A16 支持 sm_75 架构
+- \[2023/11\] TurboMind 重磅升级。包括：Paged Attention、更快的且不受序列最大长度限制的 attention kernel、2+倍快的 KV8 kernels、Split-K decoding (Flash Decoding) 和 支持 sm_75 架构的 W4A16
 - \[2023/09\] TurboMind 支持 Qwen-14B
 - \[2023/09\] TurboMind 支持 InternLM-20B 模型
 - \[2023/09\] TurboMind 支持 Code Llama 所有功能：代码续写、填空、对话、Python专项。点击[这里](./docs/zh_cn/supported_models/codellama.md)阅读部署方法

diff --git a/docs/en/turbomind_config.md b/docs/en/turbomind_config.md
@@ -4,9 +4,9 @@ TurboMind is one of the inference engines of LMDeploy. When using it to do model
 
 If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-20-config) to familiarize yourself with the configuration details.
 
-## TurboMind 1.0 config
+## TurboMind 2.0 config
 
-Taking the `llama-2-7b-chat` model as an example, in TurboMind 1.0, its `config.ini` content is as follows:
+Take the `llama-2-7b-chat` model as an example. In TurboMind 2.0, its config.ini content is as follows:
 
 ```toml
 [llama]
@@ -27,15 +27,16 @@ rotary_embedding = 128
 rope_theta = 10000.0
 size_per_head = 128
 group_size = 0
-max_batch_size = 32
+max_batch_size = 64
 max_context_token_num = 4
 step_length = 1
-cache_max_entry_count = 48
+cache_max_entry_count = 0.5
+cache_block_seq_len = 128
 cache_chunk_size = 1
 use_context_fmha = 1
 quant_policy = 0
 max_position_embeddings = 2048
-use_dynamic_ntk = 0
+rope_scaling_factor = 0.0
 use_logn_attn = 0
 ```
 
@@ -57,33 +58,45 @@ rope_theta = 10000.0
 size_per_head = 128
 ```
 
+Comparing to TurboMind 1.0, the model attribute part in the config remains the same with TurboMind 1.0, while the inference parameters have changed
 In the following sections, we will focus on introducing the inference parameters.
 
 ### data type
 
-`weight_type` and `group_size` are the relevant parameters, which cannot be modified.
+`weight_type` and `group_size` are the relevant parameters, **which cannot be modified**.
 
-`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. At present, turbomind only supports `group_size = 128`.
+`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. In LMDeploy prebuilt package, kernels with `group size = 128` are included.
 
 ### batch size
 
-`max_batch_size` determines the max size of a batch during inference. In general, the larger the batch size is, the higher the throughput is. But make sure that `max_batch_size <= cache_max_entry_count`
+The maximum batch size is still set through `max_batch_size`. But its default value has been changed from 32 to 64, and `max_batch_size` is no longer related to `cache_max_entry_count`.
 
 ### k/v cache size
 
-TurboMind allocates k/v cache memory based on `session_len`, `cache_chunk_size`, and `cache_max_entry_count`.
+k/v cache memory is determined by `cache_block_seq_len` and `cache_max_entry_count`.
 
-- `session_len` denotes the maximum length of a sequence, i.e., the size of the context window.
-- `cache_chunk_size` indicates the size of k/v sequences to be allocated when new sequences are added.
-- `cache_max_entry_count` signifies the maximum number of k/v sequences that can be cached.
+TurboMind 2.0 has implemented Paged Attention, managing the k/v cache in blocks.
+
+`cache_block_seq_len` represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:
+
+```
+cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
+```
+
+For the llama2-7b model, when storing k/v as the `half` type, the memory of a k/v block is: `128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`
+
+The meaning of `cache_max_entry_count` varies depending on its value:
+
+- When it's a decimal between (0, 1), `cache_max_entry_count` represents the percentage of memory used by k/v blocks. For example, if turbomind launches on a A100-80G GPU with `cache_max_entry_count` being `0.5`, the total memory used by the k/v blocks is `80 * 0.5 = 40G`.
+- When it's an integer no less than 1, it represents the number of k/v blocks
 
 ### kv int8 switch
 
-When initiating 8bit k/v inference, it's necessary to modify the parameters `quant_policy` and `use_context_fmha`. Please refer to [kv int8](./kv_int8.md) for a guide.
+When initiating 8bit k/v inference, set `quant_policy = 4`. Please refer to [kv int8](./kv_int8.md) for a guide.
 
 ### long context switch
 
-By setting `use_dynamic_ntk = 1`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
+By setting `rope_scaling_factor = 1.0`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
 
 Regarding the principle of Dynamic NTK, please refer to:
 
@@ -92,9 +105,9 @@ Regarding the principle of Dynamic NTK, please refer to:
 
 You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.
 
-## TurboMind 2.0 config
+## TurboMind 1.0 config
 
-In TurboMind 2.0, The model attribute part in the config remains the same with TurboMind 1.0, while the inference parameters have changed. We still take the `llama-2-7b-chat` model as an example. In TurboMind 2.0, its config.ini content is as follows:
+Taking the `llama-2-7b-chat` model as an example, in TurboMind 1.0, its `config.ini` content is as follows:
 
 ```toml
 [llama]
@@ -115,11 +128,10 @@ rotary_embedding = 128
 rope_theta = 10000.0
 size_per_head = 128
 group_size = 0
-max_batch_size = 64
+max_batch_size = 32
 max_context_token_num = 4
 step_length = 1
-cache_max_entry_count = 0.5
-cache_block_seq_len = 128
+cache_max_entry_count = 48
 cache_chunk_size = 1
 use_context_fmha = 1
 quant_policy = 0
@@ -128,37 +140,55 @@ use_dynamic_ntk = 0
 use_logn_attn = 0
 ```
 
-### data type
-
-The same as in TurboMind 1.0
+These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are **not modifiable**.
 
-### batch size
+```toml
+model_name = llama2
+head_num = 32
+kv_head_num = 32
+vocab_size = 32000
+num_layer = 32
+inter_size = 11008
+norm_eps = 1e-06
+attn_bias = 0
+start_id = 1
+end_id = 2
+rotary_embedding = 128
+rope_theta = 10000.0
+size_per_head = 128
+```
 
-The maximum batch size is still set through `max_batch_size`. But its default value has been changed from 32 to 64, and `max_batch_size` is no longer related to `cache_max_entry_count`.
+In the following sections, we will focus on introducing the inference parameters.
 
-### k/v cache size
+### data type
 
-k/v cache memory is determined by `cache_block_seq_len` and `cache_max_entry_count`.
+`weight_type` and `group_size` are the relevant parameters, **which cannot be modified**.
 
-TurboMind 2.0 has implemented Paged Attention, managing the k/v cache in blocks.
+`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. In LMDeploy prebuilt package, kernels with `group size = 128` are included.
 
-`cache_block_seq_len` represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:
+### batch size
 
-```
-cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
-```
+`max_batch_size` determines the max size of a batch during inference. In general, the larger the batch size is, the higher the throughput is. But make sure that `max_batch_size <= cache_max_entry_count`
 
-For the llama2-7b model, when storing k/v as the `half` type, the memory of a k/v block is: `128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`
+### k/v cache size
 
-The meaning of `cache_max_entry_count` varies depending on its value:
+TurboMind allocates k/v cache memory based on `session_len`, `cache_chunk_size`, and `cache_max_entry_count`.
 
-- When it's a decimal between (0, 1), `cache_max_entry_count` represents the percentage of memory used by k/v blocks. For example, if turbomind launches on a A100-80G GPU with `cache_max_entry_count` being `0.5`, the total memory used by the k/v blocks is `80 * 0.5 = 40G`.
-- When it's an integer no less than 1, it represents the number of k/v blocks
+- `session_len` denotes the maximum length of a sequence, i.e., the size of the context window.
+- `cache_chunk_size` indicates the size of k/v sequences to be allocated when new sequences are added.
+- `cache_max_entry_count` signifies the maximum number of k/v sequences that can be cached.
 
 ### kv int8 switch
 
-The same as in TurboMind 1.0
+When initiating 8bit k/v inference, change `quant_policy = 4` and `use_context_fmha = 0`. Please refer to [kv int8](./kv_int8.md) for a guide.
 
 ### long context switch
 
-The same as in TurboMind 1.0
+By setting `use_dynamic_ntk = 1`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
+
+Regarding the principle of Dynamic NTK, please refer to:
+
+1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
+2. https://kexue.fm/archives/9675
+
+You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.