feat(mlx_lm)!: batch_generate #948

llllvvuu · 2024-08-21T05:08:23Z

This is based on @willccbb's implementation at https://github.com/willccbb/mlx_parallm.

BREAKING CHANGE: generate_step takes (bs, seq_len) instead of (seq_len,). In particular, sampler and logits_processors will need to handle logits of shape (bs, vocab_size) instead of (vocab_size,).

llllvvuu · 2024-08-26T05:48:38Z

Kind of interesting: for quantized models, the throughput is doesn't go up a lot between small bs (bs=1,2,3,4), but then it starts to go up a lot at higher bs, which is the opposite of what I expected intuitively. For unquantized models the throughput does goes up between small bs. I observe the same on @willccbb's original repo.

The `prompt` argument can now be either a `str` or `list[str]`. The change to `generate()` is backwards-compatible. The changes to `generate_step()`, `top_p_sampling()`, and `min_p_sampling()` are backwards-incompatible in order to unify shapes; this could be changed by adding a few if-statements, if preferred.

llms/mlx_lm/utils.py

awni · 2024-08-29T13:40:05Z

I think it makes sense to minimize the complexity to the generate function (which is becoming a bit spaghetti) to split out the batched generation into a separate function called batch_generate. I would simplify that function to have fewer arguments (like no formatter, no printing during generation, verbose only prints the timings (e.g. as you have it now).

Also maybe more tricky is the fact that I think for this to be correct, the causal masks need to consider the left padding in the input (please correct me if I'm wrong about that). This has two implications:

Probably we'd need to add a mask parameter to the model __call__ functions and provide an appropriately constructed mask for the batch case.
The Rotating KV cache will be broken in this case (it keeps the initial tokens which would be the padded tokens) and when rotates the mask would need to be updated to consider the padding (which is a bit complicated/tedious). In this case I may suggest disabling this option entirely..

Let me know what you think about the above.

llllvvuu · 2024-08-29T13:53:52Z

I think it makes sense to minimize the complexity to the generate function (which is becoming a bit spaghetti) to split out the batched generation into a separate function called batch_generate. I would simplify that function to have fewer arguments (like no formatter, no printing during generation, verbose only prints the timings (e.g. as you have it now).

Makes sense to me, will implement.

Also maybe more tricky is the fact that I think for this to be correct, the causal masks need to consider the left padding in the input (please correct me if I'm wrong about that). This has two implications:

Probably we'd need to add a mask parameter to the model __call__ functions and provide an appropriately constructed mask for the batch case.

Yes, this sounds straightforward enough.

The Rotating KV cache will be broken in this case (it keeps the initial tokens which would be the padded tokens) and when rotates the mask would need to be updated to consider the padding (which is a bit complicated/tedious). In this case I may suggest disabling this option entirely..

I'll do a bit of thinking if there's an easy way to handle this, otherwise I'll remove that parameter in batch_generate.

Will update when these changes are ready!

awni · 2024-09-27T19:32:02Z

@llllvvuu are you coming back to this?

llllvvuu · 2024-09-28T00:05:58Z

hey @awni , sorry for the delay, I'd been job hunting this month. I should be able to get back to this in ~a week

awni · 2024-09-28T00:18:41Z

No worries, just checking. I'll follow up in a week or so.

nath1295 · 2024-10-15T21:47:27Z

Just realised the attention mask has been mentioned in this PR, which is the reason I raised this issue #1044

TODO: Re-implement `batch_generate` TODO: Update all `generate_step` callsites NOTE: `generate_step` taking `(bs, seq_len)` instead of `(seq_len,)` is a breaking change. In particular, `sampler` and `logits_processors` will need to handle logits of shape `(bs, vocab_size)` instead of `(vocab_size,)`.

llllvvuu · 2024-12-27T23:59:05Z

Sorry for the delay @awni . I took advantage of #1173 to update this PR. It is pending versioned release of ml-explore/mlx#1726 for the mask dtype.

I noticed one other potential issue: For absolute/rotary positional encodings, the position IDs of padded prompts won't start from 0 (this becomes more tricky if a padded prompt cache is added as then the position IDs should become non-contiguous IIUC). I'm not sure what the priority of this is or if it requires any change to mx.fast.rope.

qinxuye · 2025-01-13T10:57:35Z

Any update? we do need this to support parallel generation.

awni · 2025-01-13T14:00:20Z

Will get to this soon. Sorry for the delay.

llllvvuu changed the title ~~feat: support batch input in generate()~~ feat(mlx_lm): support batch input in generate() Aug 21, 2024

llllvvuu force-pushed the feat/batch_generate branch from 7332759 to 332a713 Compare August 21, 2024 05:20

llllvvuu marked this pull request as draft August 21, 2024 05:22

llllvvuu force-pushed the feat/batch_generate branch from 332a713 to 12c6066 Compare August 21, 2024 05:25

llllvvuu marked this pull request as ready for review August 21, 2024 05:25

llllvvuu force-pushed the feat/batch_generate branch from 12c6066 to ef92993 Compare August 21, 2024 05:45

llllvvuu mentioned this pull request Aug 26, 2024

feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954

Closed

llllvvuu added 2 commits August 29, 2024 05:13

feat: show batch generation progress

2caa832

llllvvuu force-pushed the feat/batch_generate branch from 5105b31 to 2caa832 Compare August 29, 2024 12:15

awni reviewed Aug 29, 2024

View reviewed changes

llms/mlx_lm/utils.py Outdated Show resolved Hide resolved

Merge branch 'main' into feat/batch_generate

8fb82fe

llllvvuu force-pushed the feat/batch_generate branch from bea0c4b to 8fb82fe Compare October 9, 2024 19:13

Merge branch 'main' into feat/batch_generate

9ee726c

llllvvuu force-pushed the feat/batch_generate branch from 308ad24 to 9ee726c Compare October 9, 2024 19:20

chimezie mentioned this pull request Oct 12, 2024

Prompt caching in mlx_lm.server #1026

Merged

llllvvuu added 2 commits October 23, 2024 18:17

Merge branch 'main' into feat/batch_generate

ed73339

llllvvuu marked this pull request as draft December 27, 2024 09:17

llllvvuu changed the title ~~feat(mlx_lm): support batch input in generate()~~ feat(mlx_lm)!: batch_generate Dec 27, 2024

llllvvuu added 3 commits December 27, 2024 01:51

update generate_step callsites

a28ca03

fix test_generate

cded149

implement batch_generate

465eb79

llllvvuu added 2 commits December 27, 2024 15:43

mask dtype

fdd16ca

tweaks

30e98c8

llllvvuu marked this pull request as ready for review December 27, 2024 23:53

llllvvuu added 2 commits December 27, 2024 16:01

dtype fix

0894808

format

2541f13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mlx_lm)!: batch_generate #948

feat(mlx_lm)!: batch_generate #948

llllvvuu commented Aug 21, 2024 •

edited

Loading

llllvvuu commented Aug 26, 2024 •

edited

Loading

awni commented Aug 29, 2024

llllvvuu commented Aug 29, 2024

awni commented Sep 27, 2024

llllvvuu commented Sep 28, 2024

awni commented Sep 28, 2024

nath1295 commented Oct 15, 2024

llllvvuu commented Dec 27, 2024 •

edited

Loading

qinxuye commented Jan 13, 2025

awni commented Jan 13, 2025

feat(mlx_lm)!: batch_generate #948

Are you sure you want to change the base?

feat(mlx_lm)!: batch_generate #948

Conversation

llllvvuu commented Aug 21, 2024 • edited Loading

llllvvuu commented Aug 26, 2024 • edited Loading

awni commented Aug 29, 2024

llllvvuu commented Aug 29, 2024

awni commented Sep 27, 2024

llllvvuu commented Sep 28, 2024

awni commented Sep 28, 2024

nath1295 commented Oct 15, 2024

llllvvuu commented Dec 27, 2024 • edited Loading

qinxuye commented Jan 13, 2025

awni commented Jan 13, 2025

llllvvuu commented Aug 21, 2024 •

edited

Loading

llllvvuu commented Aug 26, 2024 •

edited

Loading

llllvvuu commented Dec 27, 2024 •

edited

Loading