Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2 

@minhthuc2502 @alexlnkp

**Description**
What type of cache is currently implemented in CTranslate2? Is it static or dynamic? Could we achieve a speed-up if the cache implementation is changed for the decoder in encoder-decoder models?

Also, it would be great to implement recent popular quantization techniques such as [HQQ] (https://github.com/mobiusml/hqq) in ctranslate2 format.

**Motivation**
Given that a static cache (see [this PR](https://github.com/huggingface/transformers/pull/30760#issue-2290933364)) can significantly speed up processing in PyTorch encoder-decoder models via torch compilation, can we enable this in CTranslate2? This enhancement can improve decoding speed for projects utilizing CTranslate2 models, such as [Faster Whisper](https://github.com/SYSTRAN/faster-whisper).

**References**
Speed-up achieved for PyTorch-based Whisper: [Blog Post](https://mobiusml.github.io/whisper-static-cache-blog/)

**Benefits**
Implementing static caching and recent quantization techniques in CTranslate2 could lead to significant performance improvements in model decoding speeds and efficiency.

Thank you for considering this feature request!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2 #1717

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2 #1717

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions