-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moshi integration #33624
Moshi integration #33624
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the generation part makes sense to me!
suggestion: because it is quite convoluted with nested generate calls, adding a block diagram explaining the workflow and linking it to the docstring in def generate()
will likely make the life easier for us (long-term maintenance) and our user (can quickly understand what's going on)
c554cc7
to
d25a5a7
Compare
Everything's green, I'll merge |
* clean mimi commit * some nits suggestions from Arthur * make fixup * first moshi WIP * converting weights working + configuration + generation configuration * finalize converting script - still missing tokenizer and FE and processor * fix saving model w/o default config * working generation * use GenerationMixin instead of inheriting * add delay pattern mask * fix right order: moshi codes then user codes * unconditional inputs + generation config * get rid of MoshiGenerationConfig * blank user inputs * update convert script:fix conversion, add tokenizer, feature extractor and bf16 * add and correct Auto classes * update modeling code, configuration and tests * make fixup * fix some copies * WIP: add integration tests * add dummy objects * propose better readiblity and code organisation * update tokenization tests * update docstrigns, eval and modeling * add .md * make fixup * add MoshiForConditionalGeneration to ignore Auto * revert mimi changes * re * further fix * Update moshi.md * correct md formating * move prepare causal mask to class * fix copies * fix depth decoder causal * fix and correct some tests * make style and update .md * correct config checkpoitn * Update tests/models/moshi/test_tokenization_moshi.py Co-authored-by: Arthur <[email protected]> * Update tests/models/moshi/test_tokenization_moshi.py Co-authored-by: Arthur <[email protected]> * make style * Update src/transformers/models/moshi/__init__.py Co-authored-by: Arthur <[email protected]> * fixup * change firm in copyrights * udpate config with nested dict * replace einsum * make style * change split to True * add back splt=False * remove tests in convert * Update tests/models/moshi/test_modeling_moshi.py Co-authored-by: Arthur <[email protected]> * add default config repo + add model to FA2 docstrings * remove logits float * fix some tokenization tests and ignore some others * make style tokenization tests * update modeling with sliding window + update modeling tests * [run-slow] moshi * remove prepare for generation frol CausalLM * isort * remove copied from * ignore offload tests * update causal mask and prepare 4D mask aligned with recent changes * further test refine + add back prepare_inputs_for_generation for depth decoder * correct conditional use of prepare mask * update slow integration tests * fix multi-device forward * remove previous solution to device_map * save_load is flaky * fix generate multi-devices * fix device * move tensor to int --------- Co-authored-by: Arthur <[email protected]> Co-authored-by: Marc Sun <[email protected]>
What does this PR do?
Moshi is the latest Kyutai model. It is a streaming speech-to-speech model, that can also do an inner dialogue (i.e it outputs text as well).
In particular, it means that Moshi deals with 3 streams of information:
Similarly to
Musicgen
, audio is represented with audio codebooks, which can be interpreted like tokens. The main difference between text tokens and audio codebooks is that audio codebooks introduce an additional dimension of information.Text tokens are typically of dim
(batch_size, sequence_length)
but audio tokens are of dim(batch_size, num_codebooks, sequence_length)
.It's made of 3 components:
1. The main decoder (Helium in the paper)
Here, it corresponds to
MoshiForCausalLM
. It is strictly a classic text LLM, that uses an architecture similar toGemma
. In other words, it takes text tokens, embeds them, pass them through the decoder and a language head, to get text logits.2. The depth decoder
On its own, it's also a classic LLM, but this time, instead of generating over the time dimension, it generates over the codebook dimension.
It also means that its context length is
num_codebooks
-> it can't generate more thannum_codebooks
.Another interesting difference with a classic LLM is that each timestamp (here it correspond to each codebook) got its own set of Linear Layers and Embeddings.
3. Mimi
It's the audio encoder from Kyutai, that has recently been integrated to transformers, which is used to "tokenize" audio. It has the same use that
Encodec
has inMusicgen
.Architecture choice:
MoshiForCausalLM
corresponds to the main decoder, it can be used as a textual LLM.MoshiDepthDecoder
is the depth decoder mentioned aboveMoshiForConditionalGeneration
encapsulates the main decoder, the depth decoder and the audio encoder.Conceptually,
MoshiForConditionalGeneration
takes as input one stream of text and two streams of audio inputs - what the user has said so far, and what the model have generated so far - and generates two streams - a text stream and an audio stream.How does it work:
-> The input streams are embedded and combined into
inputs_embeds
.->
inputs_embeds
is passed through the main decoder. There's nothing special done here, it's the same operation as Gemma or so on.-> The main decoder outputs
text logits
but also itslast hidden state
which is calledtemporal context
in the picture above.-> the depth decoder switches the dimension on which we generate (codebooks instead of time). It uses the token generated from
text logits
and thetemporal context
to auto-regressively generate audio codebooks.