-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add basic FP8 KV cache support (#2603)
* Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml
- Loading branch information
Showing
33 changed files
with
1,015 additions
and
192 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
104 changes: 104 additions & 0 deletions
104
...sts/models/__snapshots__/test_flash_llama_fp8_kv_cache/test_flash_llama_fp8_kv_cache.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
{ | ||
"details": { | ||
"best_of_sequences": null, | ||
"finish_reason": "length", | ||
"generated_tokens": 10, | ||
"prefill": [ | ||
{ | ||
"id": 128000, | ||
"logprob": null, | ||
"text": "<|begin_of_text|>" | ||
}, | ||
{ | ||
"id": 3923, | ||
"logprob": -5.6328125, | ||
"text": "What" | ||
}, | ||
{ | ||
"id": 374, | ||
"logprob": -1.2265625, | ||
"text": " is" | ||
}, | ||
{ | ||
"id": 5655, | ||
"logprob": -9.1015625, | ||
"text": " deep" | ||
}, | ||
{ | ||
"id": 6975, | ||
"logprob": -1.8085938, | ||
"text": " learning" | ||
}, | ||
{ | ||
"id": 30, | ||
"logprob": -1.0439453, | ||
"text": "?" | ||
} | ||
], | ||
"seed": null, | ||
"tokens": [ | ||
{ | ||
"id": 18682, | ||
"logprob": -2.1992188, | ||
"special": false, | ||
"text": " Deep" | ||
}, | ||
{ | ||
"id": 6975, | ||
"logprob": -0.079956055, | ||
"special": false, | ||
"text": " learning" | ||
}, | ||
{ | ||
"id": 374, | ||
"logprob": -0.2763672, | ||
"special": false, | ||
"text": " is" | ||
}, | ||
{ | ||
"id": 264, | ||
"logprob": -0.37548828, | ||
"special": false, | ||
"text": " a" | ||
}, | ||
{ | ||
"id": 27084, | ||
"logprob": -1.4628906, | ||
"special": false, | ||
"text": " subset" | ||
}, | ||
{ | ||
"id": 315, | ||
"logprob": -0.02885437, | ||
"special": false, | ||
"text": " of" | ||
}, | ||
{ | ||
"id": 5780, | ||
"logprob": -0.2565918, | ||
"special": false, | ||
"text": " machine" | ||
}, | ||
{ | ||
"id": 6975, | ||
"logprob": -0.0063438416, | ||
"special": false, | ||
"text": " learning" | ||
}, | ||
{ | ||
"id": 430, | ||
"logprob": -1.3056641, | ||
"special": false, | ||
"text": " that" | ||
}, | ||
{ | ||
"id": 374, | ||
"logprob": -1.6035156, | ||
"special": false, | ||
"text": " is" | ||
} | ||
], | ||
"top_tokens": null | ||
}, | ||
"generated_text": " Deep learning is a subset of machine learning that is" | ||
} |
57 changes: 57 additions & 0 deletions
57
...__snapshots__/test_flash_llama_fp8_kv_cache/test_flash_llama_fp8_kv_cache_all_params.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
{ | ||
"details": { | ||
"best_of_sequences": null, | ||
"finish_reason": "eos_token", | ||
"generated_tokens": 3, | ||
"prefill": [ | ||
{ | ||
"id": 128000, | ||
"logprob": null, | ||
"text": "<|begin_of_text|>" | ||
}, | ||
{ | ||
"id": 374, | ||
"logprob": -22.96875, | ||
"text": " is" | ||
}, | ||
{ | ||
"id": 5655, | ||
"logprob": -10.71875, | ||
"text": " deep" | ||
}, | ||
{ | ||
"id": 6975, | ||
"logprob": -2.6992188, | ||
"text": " learning" | ||
}, | ||
{ | ||
"id": 30, | ||
"logprob": -4.8398438, | ||
"text": "?" | ||
} | ||
], | ||
"seed": 0, | ||
"tokens": [ | ||
{ | ||
"id": 720, | ||
"logprob": -0.4411621, | ||
"special": false, | ||
"text": " \n" | ||
}, | ||
{ | ||
"id": 220, | ||
"logprob": -0.35864258, | ||
"special": false, | ||
"text": " " | ||
}, | ||
{ | ||
"id": 128001, | ||
"logprob": 0.0, | ||
"special": true, | ||
"text": "<|end_of_text|>" | ||
} | ||
], | ||
"top_tokens": null | ||
}, | ||
"generated_text": "What is deep learning? \n " | ||
} |
Oops, something went wrong.