server : refactor slot input data, move tokenizer to HTTP thread #10023

ngxson · 2024-10-23T21:19:57Z

Motivation

The main motivation of this PR is to get rid of having json prompt as slot input data. The json data format is quite dangerous and messy to work with, as we now have to support many input shapes:

this supports these cases:
- "prompt": "string"
- "prompt": [12, 34, 56]
- "prompt": [12, 34, "string", 56, 78]
and multiple prompts (multi-tasks):
- "prompt": ["string1", "string2"]
- "prompt": ["string1", [12, 34, 56]]
- "prompt": [[12, 34, "string", 56, 78], [12, 34, 56]]

In addition, we're currently doing some post-processing (i.e. format chat template) at HTTP level, but some other are done in the inference thread (i.e. format prompt for rerank & infill)

In this PR

I tried moving things around and defining a pattern:

For HTTP thread, what it does:

Validate input types (somewhat, because some fields are still being validated by launch_slot_with_task)
Apply format (chat template, rerank template, infill format)
Tokenize the prompt --> tokens are put into task.prompt_tokens

The slot will always take an array of tokens as input, saved into slot.prompt_tokens

TODO

Add test case for infill --> use stories260K with added FIM tokens
Remove redundant ctx_server.tokenize function
Do not throw error on empty prompt
Update readme
(maybe) rename SERVER_TASK_TYPE_COMPLETION to _INFERENCE to better reflect what it does

I have read the contributing guidelines
Self-reported review complexity:
- Medium

ngxson · 2024-10-23T22:07:17Z

@ggerganov Could you please share some curl commands that you used for testing /infill api? I would like to add them as test cases. Thank you.

ggerganov · 2024-10-24T08:35:18Z

Here is a simple test that verifies that "input_extra" is used during /infill:

curl \
    --silent --no-buffer --request POST \
    --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_extra": [{"filename": "llama.h", "text": "LLAMA_API int32_t llama_n_threads(struct llama_context * ctx);\n"}], "input_suffix": "}\n", "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = ", "prompt": "", "top_k": 1, "stop": ["\n"]}' | jq

{
  "content": "llama_n_threads(NULL);",
  ...
}

Not sure what would be the smallest FIM model that this work work with. I've tested with Qwen2.5 1.5B, but it might be too big for the server tests script. If you can figure out a way to do it, would be very useful to test the /infill functionality.

In any case, I'm planning to add similar tests to ci/run.sh where we can afford to run a bigger model in the range of 1-2B.

ngxson · 2024-10-24T14:10:05Z

I ended up add FIM tokens to the existing stories260K to make it compatible with /infill endpoint, so that it can be used in tests. In the future, we could also fine tune SmolLM 360M to allow it to infill the code (as the base model is already trained on coding problems)

I ran the same test on both master and this PR, and obtained the same result.

One thing that I noticed while testing, seems like extra_tokens is not correctly added to the prompt_tokens in some cases. For example:

./llama-server -m ../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf -c 4096 -np 2 --verbose

And send the request mentioned in your last message (in my case, with n_predict = 2)

{
    "input_extra": [
        {
            "filename": "llama.h",
            "text": "LLAMA_API int32_t llama_n_threads();\n"
        }
    ],
    "input_suffix": "}\n",
    "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_",
    "prompt": "",
    "temperature": 0,
    "seed": 42,
    "n_predict": 2
}

Then observe the formatted prompt (please note that, slot.prompt is removed in this PR; the prompt in the response below is detokenize(slot.prompts_token), which should be more useful for debugging)

{
    "content": "get_num",
    "id_slot": 0,
    "stop": true,
    "model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
    "tokens_predicted": 2,
    "tokens_evaluated": 27,
    "generation_settings": {
        "n_ctx": 2048,
        "n_predict": -1,
        "model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
        "seed": 42,
        "seed_cur": 42,
        "temperature": 0.0,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "min_p": 0.05000000074505806,
        "xtc_probability": 0.0,
        "xtc_threshold": 0.10000000149011612,
        "tfs_z": 1.0,
        "typical_p": 1.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.0,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.10000000149011612,
        "penalize_nl": false,
        "stop": [],
        "max_tokens": 2,
        "n_keep": 0,
        "n_discard": 0,
        "ignore_eos": false,
        "stream": false,
        "n_probs": 0,
        "min_keep": 0,
        "grammar": "",
        "samplers": [
            "top_k",
            "tfs_z",
            "typ_p",
            "top_p",
            "min_p",
            "xtc",
            "temperature"
        ]
    },
    "prompt": "filename\n<|fim_prefix|>#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_<|fim_suffix|>}\n<|fim_middle|>",
    "has_new_line": false,
    "truncated": false,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": true,
    "stopping_word": "",
    "tokens_cached": 28,
    "timings": {
        ...
    },
    "index": 0
}

I suspect that there maybe something to do with n_extra_take, but it's quite out of the scope for the currently PR. Maybe you can fix it in another PR @ggerganov

ggerganov · 2024-10-24T16:36:42Z

I suspect that there maybe something to do with n_extra_take, but it's quite out of the scope for the currently PR

Yes, this logic seems to have issues - thank you for noticing this. I will fix this in a follow up PR.

ggerganov

Looks really well done 👍

wwoodsTM · 2024-10-24T18:12:21Z

@ngxson Just wanted to add that I really appreciate you integrating such a robust way to deal with the different kinds of prompts that are possible. I am not sure what you may have already been planning around this but if somehow my comments in the other PR about how important versatility was to me in my own situation helped inspire any of your ideas here, then I am honored that you would so rapidly incorporate that. Either way, the effort is very appreciated. Thank you!

server : refactor slot input data, move tokenizer to HTTP thread

125835b

github-actions bot added examples server labels Oct 23, 2024

ngxson added 4 commits October 23, 2024 23:39

move prompt_tokens.empty() check

5c749be

Merge branch 'master' into xsn/refactor_server_slot_input

3abc339

fix incorrect if branch

60d4194

fix infinite generation loop

b550011

ngxson added 4 commits October 24, 2024 13:20

bring back infill validation

cff97ad

add infill test

fea5ca4

try fixing format_infill

07381f7

fix test

c34ab08

github-actions bot added the python python script changes label Oct 24, 2024

ngxson added 3 commits October 24, 2024 16:23

remove redundant code

575b133

rename completion to inference

4a9f3e7

update docs

13ee779

ngxson marked this pull request as ready for review October 24, 2024 14:43

ngxson requested a review from ggerganov October 24, 2024 14:43

use llama_tokens everywhere

7f7acdb

ggerganov approved these changes Oct 24, 2024

View reviewed changes

ngxson merged commit 958367b into ggerganov:master Oct 24, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : refactor slot input data, move tokenizer to HTTP thread #10023

server : refactor slot input data, move tokenizer to HTTP thread #10023

ngxson commented Oct 23, 2024 •

edited

Loading

ngxson commented Oct 23, 2024

ggerganov commented Oct 24, 2024

ngxson commented Oct 24, 2024 •

edited

Loading

ggerganov commented Oct 24, 2024

ggerganov left a comment

wwoodsTM commented Oct 24, 2024

server : refactor slot input data, move tokenizer to HTTP thread #10023

server : refactor slot input data, move tokenizer to HTTP thread #10023

Conversation

ngxson commented Oct 23, 2024 • edited Loading

Motivation

In this PR

TODO

ngxson commented Oct 23, 2024

ggerganov commented Oct 24, 2024

ngxson commented Oct 24, 2024 • edited Loading

ggerganov commented Oct 24, 2024

ggerganov left a comment

Choose a reason for hiding this comment

wwoodsTM commented Oct 24, 2024

ngxson commented Oct 23, 2024 •

edited

Loading

ngxson commented Oct 24, 2024 •

edited

Loading