-
Notifications
You must be signed in to change notification settings - Fork 409
Description
@minhthuc2502 @alexlnkp
Description
What type of cache is currently implemented in CTranslate2? Is it static or dynamic? Could we achieve a speed-up if the cache implementation is changed for the decoder in encoder-decoder models?
Also, it would be great to implement recent popular quantization techniques such as [HQQ] (https://github.com/mobiusml/hqq) in ctranslate2 format.
Motivation
Given that a static cache (see this PR) can significantly speed up processing in PyTorch encoder-decoder models via torch compilation, can we enable this in CTranslate2? This enhancement can improve decoding speed for projects utilizing CTranslate2 models, such as Faster Whisper.
References
Speed-up achieved for PyTorch-based Whisper: Blog Post
Benefits
Implementing static caching and recent quantization techniques in CTranslate2 could lead to significant performance improvements in model decoding speeds and efficiency.
Thank you for considering this feature request!