Unexpected dramatic inference slow down after a number of LLM queries #1008
Replies: 4 comments 4 replies
-
What's the model size you are using and how much RAM is on the machine? Could you log Also it would be super useful to help debug if you could you provide a way to reproduce this issue, like the driving code and some example prompt + generation lengths to try. |
Beta Was this translation helpful? Give feedback.
-
I'm continuing to look into this. There's definitely a visible change in the memory usage as seen from Activity Monitor. Logging Versions are: I'm running the test case via a vscode Jupyter notebook and will report back when the issue happens, but unfortunately it's 30+ minutes before it triggers so it's a slow old process! |
Beta Was this translation helpful? Give feedback.
-
I've managed to trigger the issue. These values are mostly constant during the run, both pre- and post-slowdown. Peak memory: 38.656 GB I've run the sysctls and will retry my testing. |
Beta Was this translation helpful? Give feedback.
-
I've been testing for a number of hours, and it looks like the problem has gone away since the sysctl changes. Thank you for the help @awni |
Beta Was this translation helpful? Give feedback.
-
Hi folks, I'm using mlx-lm running a Q4 quantisation of Llama 3.1 on an M1 Ultra and I'm experiencing a dramatic slowdown in inference speed after something like 30 very large 5000+ token prompts have been run.
Looking at active and peak memory, that all seems to be stable, so I'm stuck as to what I can look at next. Are there any deeper metrics I can start logging out? My next thing to try would be to completely reload the model part way through my batch process, but that's brute forcing something that ideally shouldn't be happening.
I don't really have enough to open an issue with yet, so does anyone have some advice as to what I can try next?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions