Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : refactor slot input data, move tokenizer to HTTP thread #10023

Merged
merged 13 commits into from
Oct 24, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Oct 23, 2024

Motivation

Ref discussion: #9702 (comment)

The main motivation of this PR is to get rid of having json prompt as slot input data. The json data format is quite dangerous and messy to work with, as we now have to support many input shapes:

this supports these cases:
- "prompt": "string"
- "prompt": [12, 34, 56]
- "prompt": [12, 34, "string", 56, 78]
and multiple prompts (multi-tasks):
- "prompt": ["string1", "string2"]
- "prompt": ["string1", [12, 34, 56]]
- "prompt": [[12, 34, "string", 56, 78], [12, 34, 56]]

In addition, we're currently doing some post-processing (i.e. format chat template) at HTTP level, but some other are done in the inference thread (i.e. format prompt for rerank & infill)

In this PR

I tried moving things around and defining a pattern:

For HTTP thread, what it does:

  • Validate input types (somewhat, because some fields are still being validated by launch_slot_with_task)
  • Apply format (chat template, rerank template, infill format)
  • Tokenize the prompt --> tokens are put into task.prompt_tokens

The slot will always take an array of tokens as input, saved into slot.prompt_tokens

TODO

  • Add test case for infill --> use stories260K with added FIM tokens
  • Remove redundant ctx_server.tokenize function
  • Do not throw error on empty prompt
  • Update readme
  • (maybe) rename SERVER_TASK_TYPE_COMPLETION to _INFERENCE to better reflect what it does

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 23, 2024

@ggerganov Could you please share some curl commands that you used for testing /infill api? I would like to add them as test cases. Thank you.

@ggerganov
Copy link
Owner

Here is a simple test that verifies that "input_extra" is used during /infill:

curl \
    --silent --no-buffer --request POST \
    --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_extra": [{"filename": "llama.h", "text": "LLAMA_API int32_t llama_n_threads(struct llama_context * ctx);\n"}], "input_suffix": "}\n", "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = ", "prompt": "", "top_k": 1, "stop": ["\n"]}' | jq
{
  "content": "llama_n_threads(NULL);",
  ...
}

Not sure what would be the smallest FIM model that this work work with. I've tested with Qwen2.5 1.5B, but it might be too big for the server tests script. If you can figure out a way to do it, would be very useful to test the /infill functionality.

In any case, I'm planning to add similar tests to ci/run.sh where we can afford to run a bigger model in the range of 1-2B.

@github-actions github-actions bot added the python python script changes label Oct 24, 2024
@ngxson
Copy link
Collaborator Author

ngxson commented Oct 24, 2024

I ended up add FIM tokens to the existing stories260K to make it compatible with /infill endpoint, so that it can be used in tests. In the future, we could also fine tune SmolLM 360M to allow it to infill the code (as the base model is already trained on coding problems)

I ran the same test on both master and this PR, and obtained the same result.

One thing that I noticed while testing, seems like extra_tokens is not correctly added to the prompt_tokens in some cases. For example:

./llama-server -m ../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf -c 4096 -np 2 --verbose

And send the request mentioned in your last message (in my case, with n_predict = 2)

{
    "input_extra": [
        {
            "filename": "llama.h",
            "text": "LLAMA_API int32_t llama_n_threads();\n"
        }
    ],
    "input_suffix": "}\n",
    "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_",
    "prompt": "",
    "temperature": 0,
    "seed": 42,
    "n_predict": 2
}

Then observe the formatted prompt (please note that, slot.prompt is removed in this PR; the prompt in the response below is detokenize(slot.prompts_token), which should be more useful for debugging)

{
    "content": "get_num",
    "id_slot": 0,
    "stop": true,
    "model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
    "tokens_predicted": 2,
    "tokens_evaluated": 27,
    "generation_settings": {
        "n_ctx": 2048,
        "n_predict": -1,
        "model": "../models/qwen2.5-coder-1.5b-instruct-q4_k_m.gguf",
        "seed": 42,
        "seed_cur": 42,
        "temperature": 0.0,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "min_p": 0.05000000074505806,
        "xtc_probability": 0.0,
        "xtc_threshold": 0.10000000149011612,
        "tfs_z": 1.0,
        "typical_p": 1.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.0,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.10000000149011612,
        "penalize_nl": false,
        "stop": [],
        "max_tokens": 2,
        "n_keep": 0,
        "n_discard": 0,
        "ignore_eos": false,
        "stream": false,
        "n_probs": 0,
        "min_keep": 0,
        "grammar": "",
        "samplers": [
            "top_k",
            "tfs_z",
            "typ_p",
            "top_p",
            "min_p",
            "xtc",
            "temperature"
        ]
    },
    "prompt": "filename\n<|fim_prefix|>#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n    int n_threads = llama_<|fim_suffix|>}\n<|fim_middle|>",
    "has_new_line": false,
    "truncated": false,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": true,
    "stopping_word": "",
    "tokens_cached": 28,
    "timings": {
        ...
    },
    "index": 0
}

I suspect that there maybe something to do with n_extra_take, but it's quite out of the scope for the currently PR. Maybe you can fix it in another PR @ggerganov

@ngxson ngxson marked this pull request as ready for review October 24, 2024 14:43
@ngxson ngxson requested a review from ggerganov October 24, 2024 14:43
@ggerganov
Copy link
Owner

I suspect that there maybe something to do with n_extra_take, but it's quite out of the scope for the currently PR

Yes, this logic seems to have issues - thank you for noticing this. I will fix this in a follow up PR.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really well done 👍

@wwoodsTM
Copy link
Contributor

@ngxson Just wanted to add that I really appreciate you integrating such a robust way to deal with the different kinds of prompts that are possible. I am not sure what you may have already been planning around this but if somehow my comments in the other PR about how important versatility was to me in my own situation helped inspire any of your ideas here, then I am honored that you would so rapidly incorporate that. Either way, the effort is very appreciated. Thank you!

@ngxson ngxson merged commit 958367b into ggerganov:master Oct 24, 2024
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants