Export state/prefix-cache & reuse #14895
M0rpheus-0
announced in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am using vLLM in a python script, and serve my own inference endpoint through flask.
I do this due to some constraints that require some custom logic during inference.
Having used
llama.cpp
in the past, I could make use of a feature likellm.save_state()
, where you basically export the model's hidden state (prefix cache), and reload it when you want to save time on re-ingesting the prefix.My use-case is one where I have 3 large prompts followed by a small custom instruction at the end.
I would like to keep those 3 prompts cached for efficiency's sake, and reload them as necessary to cut down the ingestion/preload time.
Does vLLM offer this functionality?
If not, is there some way I could implement it?
Thank you all!
Beta Was this translation helpful? Give feedback.
All reactions