Moshi integration #33624

ylacombe · 2024-09-20T14:22:39Z

What does this PR do?

Moshi is the latest Kyutai model. It is a streaming speech-to-speech model, that can also do an inner dialogue (i.e it outputs text as well).

In particular, it means that Moshi deals with 3 streams of information:

The user's audio
Moshi's audio
Moshi's textual output

Similarly to Musicgen, audio is represented with audio codebooks, which can be interpreted like tokens. The main difference between text tokens and audio codebooks is that audio codebooks introduce an additional dimension of information.
Text tokens are typically of dim (batch_size, sequence_length) but audio tokens are of dim (batch_size, num_codebooks, sequence_length).

It's made of 3 components:

1. The main decoder (Helium in the paper)

Here, it corresponds to MoshiForCausalLM. It is strictly a classic text LLM, that uses an architecture similar to Gemma. In other words, it takes text tokens, embeds them, pass them through the decoder and a language head, to get text logits.

2. The depth decoder

On its own, it's also a classic LLM, but this time, instead of generating over the time dimension, it generates over the codebook dimension.

It also means that its context length is num_codebooks -> it can't generate more than num_codebooks.

Another interesting difference with a classic LLM is that each timestamp (here it correspond to each codebook) got its own set of Linear Layers and Embeddings.

3. Mimi

It's the audio encoder from Kyutai, that has recently been integrated to transformers, which is used to "tokenize" audio. It has the same use that Encodec has in Musicgen.

Architecture choice:

MoshiForCausalLM corresponds to the main decoder, it can be used as a textual LLM.
MoshiDepthDecoder is the depth decoder mentioned above
MoshiForConditionalGeneration encapsulates the main decoder, the depth decoder and the audio encoder.

Conceptually, MoshiForConditionalGeneration takes as input one stream of text and two streams of audio inputs - what the user has said so far, and what the model have generated so far - and generates two streams - a text stream and an audio stream.

How does it work:

-> The input streams are embedded and combined into inputs_embeds.

-> inputs_embeds is passed through the main decoder. There's nothing special done here, it's the same operation as Gemma or so on.

-> The main decoder outputs text logits but also its last hidden state which is called temporal context in the picture above.

-> the depth decoder switches the dimension on which we generate (codebooks instead of time). It uses the token generated from text logits and the temporal context to auto-regressively generate audio codebooks.

…ssor

src/transformers/models/moshi/modeling_moshi.py

gante

the generation part makes sense to me!

suggestion: because it is quite convoluted with nested generate calls, adding a block diagram explaining the workflow and linking it to the docstring in def generate() will likely make the life easier for us (long-term maintenance) and our user (can quickly understand what's going on)

src/transformers/models/moshi/modeling_moshi.py

…r and bf16

…h decoder

ylacombe · 2024-10-14T09:32:09Z

Hey @SunMarc, we could benefit from your help here: When using device_map="auto", I'm facing a lot of issues, see here and here. Everything passes locally though.

ylacombe · 2024-10-16T09:20:41Z

Everything's green, I'll merge
Thanks @ArthurZucker and @SunMarc for the support!

* clean mimi commit * some nits suggestions from Arthur * make fixup * first moshi WIP * converting weights working + configuration + generation configuration * finalize converting script - still missing tokenizer and FE and processor * fix saving model w/o default config * working generation * use GenerationMixin instead of inheriting * add delay pattern mask * fix right order: moshi codes then user codes * unconditional inputs + generation config * get rid of MoshiGenerationConfig * blank user inputs * update convert script:fix conversion, add tokenizer, feature extractor and bf16 * add and correct Auto classes * update modeling code, configuration and tests * make fixup * fix some copies * WIP: add integration tests * add dummy objects * propose better readiblity and code organisation * update tokenization tests * update docstrigns, eval and modeling * add .md * make fixup * add MoshiForConditionalGeneration to ignore Auto * revert mimi changes * re * further fix * Update moshi.md * correct md formating * move prepare causal mask to class * fix copies * fix depth decoder causal * fix and correct some tests * make style and update .md * correct config checkpoitn * Update tests/models/moshi/test_tokenization_moshi.py Co-authored-by: Arthur <[email protected]> * Update tests/models/moshi/test_tokenization_moshi.py Co-authored-by: Arthur <[email protected]> * make style * Update src/transformers/models/moshi/__init__.py Co-authored-by: Arthur <[email protected]> * fixup * change firm in copyrights * udpate config with nested dict * replace einsum * make style * change split to True * add back splt=False * remove tests in convert * Update tests/models/moshi/test_modeling_moshi.py Co-authored-by: Arthur <[email protected]> * add default config repo + add model to FA2 docstrings * remove logits float * fix some tokenization tests and ignore some others * make style tokenization tests * update modeling with sliding window + update modeling tests * [run-slow] moshi * remove prepare for generation frol CausalLM * isort * remove copied from * ignore offload tests * update causal mask and prepare 4D mask aligned with recent changes * further test refine + add back prepare_inputs_for_generation for depth decoder * correct conditional use of prepare mask * update slow integration tests * fix multi-device forward * remove previous solution to device_map * save_load is flaky * fix generate multi-devices * fix device * move tensor to int --------- Co-authored-by: Arthur <[email protected]> Co-authored-by: Marc Sun <[email protected]>

abigalegeathers · 2024-10-25T22:31:27Z

What does this PR do?

Moshi is the latest Kyutai model. It is a streaming speech-to-speech model, that can also do an inner dialogue (i.e it outputs text as well).

In particular, it means that Moshi deals with 3 streams of information:

The user's audio

Moshi's audio

Moshi's textual output

Similarly to Musicgen, audio is represented with audio codebooks, which can be interpreted like tokens. The main difference between text tokens and audio codebooks is that audio codebooks introduce an additional dimension of information. Text tokens are typically of dim (batch_size, sequence_length) but audio tokens are of dim (batch_size, num_codebooks, sequence_length).

It's made of 3 components:

1. The main decoder (Helium in the paper)

Here, it corresponds to MoshiForCausalLM. It is strictly a classic text LLM, that uses an architecture similar to Gemma. In other words, it takes text tokens, embeds them, pass them through the decoder and a language head, to get text logits.

2. The depth decoder

On its own, it's also a classic LLM, but this time, instead of generating over the time dimension, it generates over the codebook dimension.

It also means that its context length is num_codebooks -> it can't generate more than num_codebooks.

Another interesting difference with a classic LLM is that each timestamp (here it correspond to each codebook) got its own set of Linear Layers and Embeddings.

3. Mimi

It's the audio encoder from Kyutai, that has recently been integrated to transformers, which is used to "tokenize" audio. It has the same use that Encodec has in Musicgen.

Architecture choice:

MoshiForCausalLM corresponds to the main decoder, it can be used as a textual LLM.

MoshiDepthDecoder is the depth decoder mentioned above

MoshiForConditionalGeneration encapsulates the main decoder, the depth decoder and the audio encoder.

Conceptually, MoshiForConditionalGeneration takes as input one stream of text and two streams of audio inputs - what the user has said so far, and what the model have generated so far - and generates two streams - a text stream and an audio stream.

How does it work:

-> The input streams are embedded and combined into inputs_embeds.

-> inputs_embeds is passed through the main decoder. There's nothing special done here, it's the same operation as Gemma or so on.

-> The main decoder outputs text logits but also its last hidden state which is called temporal context in the picture above.

-> the depth decoder switches the dimension on which we generate (codebooks instead of time). It uses the token generated from text logits and the temporal context to auto-regressively generate audio codebooks.

https://github.com/huggingface/transformers/blob/1d063793318b20654ebb850f48f43e0a247ab7bb/CODE_OF_CONDUCT.md

ylacombe and others added 9 commits September 13, 2024 11:14

clean mimi commit

c086231

some nits suggestions from Arthur

a544d27

make fixup

502865f

first moshi WIP

c858321

converting weights working + configuration + generation configuration

2eaadca

finalize converting script - still missing tokenizer and FE and proce…

016d538

…ssor

fix saving model w/o default config

34b6e24

working generation

50f9eb8

Merge branch 'main' into moshi-integration

81432c0

ylacombe commented Sep 20, 2024

View reviewed changes

src/transformers/models/moshi/modeling_moshi.py Outdated Show resolved Hide resolved

ylacombe commented Sep 20, 2024

View reviewed changes

src/transformers/models/moshi/modeling_moshi.py Outdated Show resolved Hide resolved

gante reviewed Sep 20, 2024

View reviewed changes

src/transformers/models/moshi/modeling_moshi.py Outdated Show resolved Hide resolved

ylacombe and others added 18 commits September 24, 2024 09:37

Merge branch 'huggingface:main' into moshi-integration

1e05194

use GenerationMixin instead of inheriting

199e133

add delay pattern mask

c6fe08e

fix right order: moshi codes then user codes

b0fdc48

unconditional inputs + generation config

5553ade

get rid of MoshiGenerationConfig

5681854

blank user inputs

852f1e1

update convert script:fix conversion, add tokenizer, feature extracto…

d1fe2fe

…r and bf16

add and correct Auto classes

76d6e47

update modeling code, configuration and tests

d1fcc4c

make fixup

123b1f2

fix some copies

25e88c2

WIP: add integration tests

1db75ac

add dummy objects

8fe84b7

Merge branch 'main' into moshi-integration

c2b1e60

propose better readiblity and code organisation

2499419

update tokenization tests

58d7224

update docstrigns, eval and modeling

faca73b

ylacombe and others added 6 commits October 7, 2024 17:35

[run-slow] moshi

137985d

Merge branch 'huggingface:main' into moshi-integration

9288bc7

Merge branch 'huggingface:main' into moshi-integration

1a6c1c2

Merge branch 'huggingface:main' into moshi-integration

52c599c

remove prepare for generation frol CausalLM

6e9b4bd

isort

d25a5a7

ylacombe force-pushed the moshi-integration branch from c554cc7 to d25a5a7 Compare October 9, 2024 17:25

ylacombe and others added 10 commits October 9, 2024 19:40

remove copied from

9e8fd4e

ignore offload tests

ab8f356

Merge branch 'huggingface:main' into moshi-integration

9650b2b

update causal mask and prepare 4D mask aligned with recent changes

7b1ca20

further test refine + add back prepare_inputs_for_generation for dept…

896ee4a

…h decoder

correct conditional use of prepare mask

a9c3d6f

Merge branch 'huggingface:main' into moshi-integration

38b5fee

update slow integration tests

5784642

fix multi-device forward

4f8f877

Merge branch 'huggingface:main' into moshi-integration

1f912c7

ArthurZucker self-requested a review October 11, 2024 15:15

ylacombe and others added 5 commits October 14, 2024 11:32

remove previous solution to device_map

a2d5519

save_load is flaky

bf4375f

Merge remote-tracking branch 'upstream/main' into moshi-integration

dda8da2

fix generate multi-devices

f06e230

fix device

a54d218

ArthurZucker approved these changes Oct 16, 2024

View reviewed changes

move tensor to int

f46e0d8

ylacombe merged commit 9ba021e into huggingface:main Oct 16, 2024
24 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moshi integration #33624

Moshi integration #33624

ylacombe commented Sep 20, 2024

gante left a comment

ylacombe commented Oct 14, 2024

ylacombe commented Oct 16, 2024

abigalegeathers commented Oct 25, 2024

What does this PR do?

Architecture choice:

Moshi integration #33624

Moshi integration #33624

Conversation

ylacombe commented Sep 20, 2024

What does this PR do?

Architecture choice:

gante left a comment

Choose a reason for hiding this comment

ylacombe commented Oct 14, 2024

ylacombe commented Oct 16, 2024

abigalegeathers commented Oct 25, 2024

What does this PR do?

Architecture choice: