Skip to content

Commit

Permalink
upate user guide according to review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
lvhan028 committed Nov 19, 2023
1 parent 2c1a466 commit cbe7108
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 1 deletion.
8 changes: 7 additions & 1 deletion docs/en/turbomind_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,13 @@ For the llama2-7b model, when storing k/v as the `half` type, the memory of a k/
The meaning of `cache_max_entry_count` varies depending on its value:

- When it's a decimal between (0, 1), `cache_max_entry_count` represents the percentage of memory used by k/v blocks. For example, if turbomind launches on a A100-80G GPU with `cache_max_entry_count` being `0.5`, the total memory used by the k/v blocks is `80 * 0.5 = 40G`.
- When it's an integer no less than 1, it represents the number of k/v blocks
- When it's an integer > 0, it represents the total number of k/v blocks

The `cache_chunk_size` indicates the size of the k/v cache chunk to be allocated each time new k/v cache blocks are needed. Different values represent different meanings:

- When it is an integer > 0, `cache_chunk_size` number of k/v cache blocks are allocated.
- When the value is -1, `cache_max_entry_count` number of k/v cache blocks are allocated.
- When the value is 0, `sqrt(cache_max_entry_count)` number of k/v cache blocks are allocated.

### kv int8 switch

Expand Down
6 changes: 6 additions & 0 deletions docs/zh_cn/turbomind_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@ cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_da
- 当值为 (0, 1) 之间的小数时,`cache_max_entry_count` 表示 k/v block 使用的内存百分比。比如 A100-80G 显卡内存是80G,当`cache_max_entry_count`为0.5时,表示 k/v block 使用的内存总量为 80 * 0.5 = 40G
- 当值为 > 1的整数时,表示 k/v block 数量

`cache_chunk_size` 表示在每次需要新的 k/v cache 块时,开辟 k/v cache 块的大小。不同的取值,表示不同的含义:

- 当为 > 0 的整数时,开辟 `cache_chunk_size` 个 k/v cache 块
- 当值为 -1 时,开辟 `cache_max_entry_count` 个 k/v cache 块
- 当值为 0 时,时,开辟 `sqrt(cache_max_entry_count)` 个 k/v cache 块

### kv int8 开关

`quant_policy`是 KV-int8 推理开关。具体使用方法,请参考 [kv int8](./kv_int8.md) 部署文档
Expand Down

0 comments on commit cbe7108

Please sign in to comment.