Skip to content

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2  #1717

@Jiltseb

Description

@Jiltseb

@minhthuc2502 @alexlnkp

Description
What type of cache is currently implemented in CTranslate2? Is it static or dynamic? Could we achieve a speed-up if the cache implementation is changed for the decoder in encoder-decoder models?

Also, it would be great to implement recent popular quantization techniques such as [HQQ] (https://github.com/mobiusml/hqq) in ctranslate2 format.

Motivation
Given that a static cache (see this PR) can significantly speed up processing in PyTorch encoder-decoder models via torch compilation, can we enable this in CTranslate2? This enhancement can improve decoding speed for projects utilizing CTranslate2 models, such as Faster Whisper.

References
Speed-up achieved for PyTorch-based Whisper: Blog Post

Benefits
Implementing static caching and recent quantization techniques in CTranslate2 could lead to significant performance improvements in model decoding speeds and efficiency.

Thank you for considering this feature request!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions