Detailed description of the requested feature
Support for quantization and deployment of Qwen3-TTS-style models within the NVIDIA optimization stack, ideally including compatibility with TensorRT-LLM or a clearly defined alternative pipeline.
Specifically, the request is for:
Ability to quantize non-Transformer / non-text-generation models (e.g., TTS pipelines) using a unified workflow similar to LLMs
Support for multi-component models, including:
text encoder (Transformer-based)
acoustic model (autoregressive / diffusion / codec-based)
vocoder (CNN-based waveform generator)
End-to-end export pipeline:
PyTorch → Quantization → ONNX → TensorRT engine(s)
Guidance or tooling for:
handling models not implemented in Hugging Face Transformers
exporting models with custom forward passes or generation loops
Optional: partial support for prefill/decode-style optimization where applicable (e.g., transformer submodules)
This would enable efficient deployment of modern TTS systems on NVIDIA GPUs with reduced latency and memory usage.
Describe alternatives you've considered
- torch AO library
Target hardware/use case
- NVIDIA GPUs (eg. A5000, etc.)
Detailed description of the requested feature
Support for quantization and deployment of Qwen3-TTS-style models within the NVIDIA optimization stack, ideally including compatibility with TensorRT-LLM or a clearly defined alternative pipeline.
Specifically, the request is for:
Ability to quantize non-Transformer / non-text-generation models (e.g., TTS pipelines) using a unified workflow similar to LLMs
Support for multi-component models, including:
text encoder (Transformer-based)
acoustic model (autoregressive / diffusion / codec-based)
vocoder (CNN-based waveform generator)
End-to-end export pipeline:
PyTorch → Quantization → ONNX → TensorRT engine(s)
Guidance or tooling for:
handling models not implemented in Hugging Face Transformers
exporting models with custom forward passes or generation loops
Optional: partial support for prefill/decode-style optimization where applicable (e.g., transformer submodules)
This would enable efficient deployment of modern TTS systems on NVIDIA GPUs with reduced latency and memory usage.
Describe alternatives you've considered
Target hardware/use case