Skip to content

Added dynamic context size. This is perfect for servers running llama models as a service. #13295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

J4e6eR
Copy link

@J4e6eR J4e6eR commented May 4, 2025

The context size which is used to allocate the space for model execution and KV caches, cannot be modified once the model and context params are initialized. This can be bad for servers running models as the context sizes are bound to increase overtime.
With dynamic context size, there is no need to restart the servers once the context size exceeds.

Dynamic context size is achieved by modifying the size of n_ctx in cparams followed by resetting the previous memory to create new memory using memory.reset(model.create_memory(params_mem, cparams));.
As new memory is created, the earlier context is deleted, the best way to save and load the state to preserve.

I will add load state feature as a default while performing this operation in next commit.

@J4e6eR
Copy link
Author

J4e6eR commented May 5, 2025

Next goal is to get a dynamic context size working without the need for resetting memory. Is it possible? Let's see!!

@J4e6eR
Copy link
Author

J4e6eR commented May 7, 2025

Hey, @ggerganov
Please have a look at this, this can be helpful for the servers which might need a dynamic context size, which would prevent it from terminating with errors when the program exceeds the context size.
I am currently working on follow-up task which I posted earlier.
Furthermore, are there any changes you expect me to do, to improve this commit, I am open for suggestions and improvements.
Thank you.

@ggerganov
Copy link
Member

Hi, I am not convinced that this is a useful feature. IMO the application should pre-allocate the worst-case amount of memory that it plans to use. This way, if it is able to start, you have a guarantee that it will keep running without running out of memory at some later point.

I don't see use cases where dynamically adjusting the context has an advantage compared to the existing logic.

@J4e6eR
Copy link
Author

J4e6eR commented May 7, 2025

@ggerganov
So if the application allocates more amount of memory before hand, what's the significance of context size (n_ctx)?
Because earlier when I was testing one of the example codes, probably simple-chat, I did exceed the context size after few back and forth conversations with the model, and it actually terminated the program giving an error message "Context size exceeded".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants