Skip to content

Commit

Permalink
rename repo id + change readme
Browse files Browse the repository at this point in the history
  • Loading branch information
ylacombe committed Sep 18, 2024
1 parent 502865f commit 5ad87a8
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 14 deletions.
14 changes: 6 additions & 8 deletions docs/source/en/model_doc/mimi.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,22 +18,20 @@ rendered properly in your Markdown viewer.

## Overview

The Mimi model was proposed in [<INSERT PAPER NAME HERE>](<INSERT PAPER LINK HERE>) by <INSERT AUTHORS HERE>.
The Mimi model was proposed in [Moshi: a speech-text foundation model for real-time dialogue](https://kyutai.org/Moshi.pdf) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.

The abstract from the paper is the following:

*<INSERT PAPER ABSTRACT HERE>*
*We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.*

Mimi is a high-fidelity audio codec model developed by the Kyutai team. It can be used to project audio waveforms into quantized latent spaces, and vice versa. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.
Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.


Its architecture is based on [Encodec](model_doc/encodec) with several major differences:
* it uses a much lower frame-rate.
* it uses additional transformers for encoding and decoding for better latent contextualization
* it uses a different quantization scheme: one codebook is dedicated to semantic projection.



## Usage example

Here is a quick example of how to encode and decode an audio using this model:
Expand All @@ -44,8 +42,8 @@ Here is a quick example of how to encode and decode an audio using this model:
>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

>>> # load model and feature extractor
>>> model = MimiModel.from_pretrained("kmhf/mimi") # TODO(YL): modify once official
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("kmhf/mimi")
>>> model = MimiModel.from_pretrained("kyutai/mimi")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/mimi")

>>> # load audio sample
>>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
Expand All @@ -59,7 +57,7 @@ Here is a quick example of how to encode and decode an audio using this model:
```

This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe).
The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).
The original code can be found [here](https://github.com/kyutai-labs/moshi).


## MimiConfig
Expand Down
6 changes: 3 additions & 3 deletions src/transformers/models/mimi/configuration_mimi.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class MimiConfig(PretrainedConfig):
This is the configuration class to store the configuration of an [`MimiModel`]. It is used to instantiate a
Mimi model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the
[kmhf/mimi](https://huggingface.co/kmhf/mimi) architecture.
[kyutai/mimi](https://huggingface.co/kyutai/mimi) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Expand Down Expand Up @@ -126,10 +126,10 @@ class MimiConfig(PretrainedConfig):
```python
>>> from transformers import MimiModel, MimiConfig
>>> # Initializing a "kmhf/mimi" style configuration
>>> # Initializing a "kyutai/mimi" style configuration
>>> configuration = MimiConfig()
>>> # Initializing a model (with random weights) from the "kmhf/mimi" style configuration
>>> # Initializing a model (with random weights) from the "kyutai/mimi" style configuration
>>> model = MimiModel(configuration)
>>> # Accessing the model configuration
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/models/mimi/modeling_mimi.py
Original file line number Diff line number Diff line change
Expand Up @@ -1679,7 +1679,7 @@ def forward(
>>> dataset = load_dataset("hf-internal-testing/ashraq-esc50-1-dog-example")
>>> audio_sample = dataset["train"]["audio"][0]["array"]
>>> model_id = "kmhf/mimi"
>>> model_id = "kyutai/mimi"
>>> model = MimiModel.from_pretrained(model_id)
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
Expand Down
4 changes: 2 additions & 2 deletions tests/models/mimi/test_modeling_mimi.py
Original file line number Diff line number Diff line change
Expand Up @@ -787,7 +787,7 @@ def test_integration_using_cache_decode(self):
}

librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
model_id = "kmhf/mimi" # TODO(YL): modify once official
model_id = "kyutai/mimi"

model = MimiModel.from_pretrained(model_id, use_cache=True).to(torch_device)
processor = AutoFeatureExtractor.from_pretrained(model_id)
Expand Down Expand Up @@ -837,7 +837,7 @@ def test_integration(self):
"32": 1803071,
}
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
model_id = "kmhf/mimi" # TODO(YL): modify once official
model_id = "kyutai/mimi"

processor = AutoFeatureExtractor.from_pretrained(model_id)

Expand Down

0 comments on commit 5ad87a8

Please sign in to comment.