Unexpected dramatic inference slow down after a number of LLM queries #1008

nickludlam · 2024-10-01T20:57:10Z

nickludlam
Oct 1, 2024

Hi folks, I'm using mlx-lm running a Q4 quantisation of Llama 3.1 on an M1 Ultra and I'm experiencing a dramatic slowdown in inference speed after something like 30 very large 5000+ token prompts have been run.

Looking at active and peak memory, that all seems to be stable, so I'm stuck as to what I can look at next. Are there any deeper metrics I can start logging out? My next thing to try would be to completely reload the model part way through my batch process, but that's brute forcing something that ideally shouldn't be happening.

I don't really have enough to open an issue with yet, so does anyone have some advice as to what I can try next?

Thanks!

awni · 2024-10-02T20:32:39Z

awni
Oct 2, 2024
Maintainer

What's the model size you are using and how much RAM is on the machine?

Could you log mx.metal.get_cache_memory as well?

Also it would be super useful to help debug if you could you provide a way to reproduce this issue, like the driving code and some example prompt + generation lengths to try.

1 reply

awni Oct 2, 2024
Maintainer

Also what version of MLX and MLX LM are you using? If they are older, try upgrading just to be sure.

nickludlam · 2024-10-02T20:36:20Z

nickludlam
Oct 2, 2024
Author

I'm continuing to look into this. There's definitely a visible change in the memory usage as seen from Activity Monitor.

Regular operation:

Slow operation:

Logging get_peak_memory(), get_active_memory() doesn't show any changes when the slowdown occurs. I don't yet know what the cache memory usage looks like. (I'll update this when I get the next run's numbers back)

Versions are:
mlx 0.17.3
mlx-lm 0.18.2

I'm running the test case via a vscode Jupyter notebook and will report back when the issue happens, but unfortunately it's 30+ minutes before it triggers so it's a slow old process!
I will also see if it happens outside of the notebook environment just to rule that out.

3 replies

awni Oct 2, 2024
Maintainer

Also which model / model size are you using?

It's extremely odd that the wired memory gets cut in half when the slow down happens. It seems that the system doesn't want to keep the MLX memory wired (which would explain dramatic slowdowns).. but I don't know why that is.

Another couple ideas to test out if you have time:

Upgrade to Sequoia (OS 15.0) (resolves some issues with wired memory which may be relevant here)
Set the sysctls sudo sysctl iogpu.wired_lwm_mb=400000 && sudo sysctl iogpu.iogpu.wired_limit_mb=120000

nickludlam Oct 2, 2024
Author

Thanks for the reply awni, I'm already on Sequoia, and I'll let the current run finish so I can monitor cache memory before trying the sysctl changes.

nickludlam Oct 2, 2024
Author

Oh and the model is https://huggingface.co/mlx-community/Meta-Llama-3.1-70B-Instruct-4bit

nickludlam · 2024-10-03T10:21:38Z

nickludlam
Oct 3, 2024
Author

I've managed to trigger the issue. These values are mostly constant during the run, both pre- and post-slowdown.

Peak memory: 38.656 GB
Active memory: 38.501 GB
Cache memory: 2.937 GB

I've run the sysctls and will retry my testing.

0 replies

nickludlam · 2024-10-03T17:12:43Z

nickludlam
Oct 3, 2024
Author

I've been testing for a number of hours, and it looks like the problem has gone away since the sysctl changes. Thank you for the help @awni

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected dramatic inference slow down after a number of LLM queries #1008

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unexpected dramatic inference slow down after a number of LLM queries #1008

nickludlam Oct 1, 2024

Replies: 4 comments · 4 replies

awni Oct 2, 2024 Maintainer

awni Oct 2, 2024 Maintainer

nickludlam Oct 2, 2024 Author

awni Oct 2, 2024 Maintainer

nickludlam Oct 2, 2024 Author

nickludlam Oct 2, 2024 Author

nickludlam Oct 3, 2024 Author

nickludlam Oct 3, 2024 Author

nickludlam
Oct 1, 2024

Replies: 4 comments 4 replies

awni
Oct 2, 2024
Maintainer

awni Oct 2, 2024
Maintainer

nickludlam
Oct 2, 2024
Author

awni Oct 2, 2024
Maintainer

nickludlam Oct 2, 2024
Author

nickludlam Oct 2, 2024
Author

nickludlam
Oct 3, 2024
Author

nickludlam
Oct 3, 2024
Author