-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : refactor slot input data, move tokenizer to HTTP thread #10023
server : refactor slot input data, move tokenizer to HTTP thread #10023
Conversation
@ggerganov Could you please share some curl commands that you used for testing |
Here is a simple test that verifies that "input_extra" is used during curl \
--silent --no-buffer --request POST \
--url http://127.0.0.1:8012/infill \
--header "Content-Type: application/json" \
--data '{"input_extra": [{"filename": "llama.h", "text": "LLAMA_API int32_t llama_n_threads(struct llama_context * ctx);\n"}], "input_suffix": "}\n", "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = ", "prompt": "", "top_k": 1, "stop": ["\n"]}' | jq {
"content": "llama_n_threads(NULL);",
...
} Not sure what would be the smallest FIM model that this work work with. I've tested with Qwen2.5 1.5B, but it might be too big for the server tests script. If you can figure out a way to do it, would be very useful to test the In any case, I'm planning to add similar tests to |
I ended up add FIM tokens to the existing stories260K to make it compatible with I ran the same test on both One thing that I noticed while testing, seems like
And send the request mentioned in your last message (in my case, with {
"input_extra": [
{
"filename": "llama.h",
"text": "LLAMA_API int32_t llama_n_threads();\n"
}
],
"input_suffix": "}\n",
"input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_",
"prompt": "",
"temperature": 0,
"seed": 42,
"n_predict": 2
} Then observe the formatted prompt (please note that, {
"content": "get_num",
"id_slot": 0,
"stop": true,
"model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
"tokens_predicted": 2,
"tokens_evaluated": 27,
"generation_settings": {
"n_ctx": 2048,
"n_predict": -1,
"model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
"seed": 42,
"seed_cur": 42,
"temperature": 0.0,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"top_k": 40,
"top_p": 0.949999988079071,
"min_p": 0.05000000074505806,
"xtc_probability": 0.0,
"xtc_threshold": 0.10000000149011612,
"tfs_z": 1.0,
"typical_p": 1.0,
"repeat_last_n": 64,
"repeat_penalty": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.10000000149011612,
"penalize_nl": false,
"stop": [],
"max_tokens": 2,
"n_keep": 0,
"n_discard": 0,
"ignore_eos": false,
"stream": false,
"n_probs": 0,
"min_keep": 0,
"grammar": "",
"samplers": [
"top_k",
"tfs_z",
"typ_p",
"top_p",
"min_p",
"xtc",
"temperature"
]
},
"prompt": "filename\n<|fim_prefix|>#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_<|fim_suffix|>}\n<|fim_middle|>",
"has_new_line": false,
"truncated": false,
"stopped_eos": false,
"stopped_word": false,
"stopped_limit": true,
"stopping_word": "",
"tokens_cached": 28,
"timings": {
...
},
"index": 0
} I suspect that there maybe something to do with |
Yes, this logic seems to have issues - thank you for noticing this. I will fix this in a follow up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really well done 👍
@ngxson Just wanted to add that I really appreciate you integrating such a robust way to deal with the different kinds of prompts that are possible. I am not sure what you may have already been planning around this but if somehow my comments in the other PR about how important versatility was to me in my own situation helped inspire any of your ideas here, then I am honored that you would so rapidly incorporate that. Either way, the effort is very appreciated. Thank you! |
Motivation
Ref discussion: #9702 (comment)
The main motivation of this PR is to get rid of having
json prompt
as slot input data. Thejson
data format is quite dangerous and messy to work with, as we now have to support many input shapes:In addition, we're currently doing some post-processing (i.e. format chat template) at HTTP level, but some other are done in the inference thread (i.e. format prompt for rerank & infill)
In this PR
I tried moving things around and defining a pattern:
For HTTP thread, what it does:
launch_slot_with_task
)task.prompt_tokens
The
slot
will always take an array of tokens as input, saved intoslot.prompt_tokens
TODO
ctx_server.tokenize
functionSERVER_TASK_TYPE_COMPLETION
to_INFERENCE
to better reflect what it does