Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving benchmarking token counting accuracy #726

Open
nwangfw opened this issue Feb 21, 2025 · 0 comments
Open

Improving benchmarking token counting accuracy #726

nwangfw opened this issue Feb 21, 2025 · 0 comments
Assignees

Comments

@nwangfw
Copy link
Collaborator

nwangfw commented Feb 21, 2025

🐛 Describe the bug

It seems that the python client side token count is not same as the VLLM token counting results. The difference is about 300 for input length. It may worth to take a look of this issue.

"Prompt": "You are an AI programming assistant and your task is to generate a SQL query based on the input database schema and user questions.\n### Task Description:\nGiven the following database schema, please write a SQL query to answer the given question.\n\n### Schema:\nThe database contains 4 tables: ['atom', 'bond', 'connected', 'molecule'].\n\n- Table: atom\n\t- Description: The table atom has 3 columns: ['atom_id', 'molecule_id', 'element'].\n\t- Primary Key: atom_id\n\t- Foreign Keys: atom.molecule_id = molecule.molecule_id\n\t- Column: atom_id\n\t\t- Type: TEXT\n\t\t- Description: the unique id of atoms\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: identifying the molecule to which the atom belongs\n\t\t- Sampled Values: TR000, TR001, TR002, TR003, TR004\n\t- Column: element\n\t\t- Type: TEXT\n\t\t- Description: the element of the toxicology\n\t\t- Sampled Values: b, br, c, ca, cl\n\n- Table: bond\n\t- Description: The table bond has 3 columns: ['bond_id', 'molecule_id', 'bond_type'].\n\t- Primary Key: bond_id\n\t- Foreign Keys: bond.molecule_id = molecule.molecule_id\n\t- Column: bond_id\n\t\t- Type: TEXT\n\t\t- Description: unique id representing bonds\n\t\t- Sampled Values: TR000_1_2, TR000_2_3, TR000_2_4, TR000_2_5, TR001_10_11\n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: identifying the molecule in which the bond appears\n\t\t- Sampled Values: TR000, TR001, TR002, TR003, TR004\n\t- Column: bond_type\n\t\t- Type: TEXT\n\t\t- Description: type of the bond\n\t\t- Sampled Values: None, #, -, =\n\n- Table: connected\n\t- Description: The table connected has 3 columns: ['atom_id', 'atom_id2', 'bond_id'].\n\t- Primary Key: ['atom_id', 'atom_id2']\n\t- Foreign Keys: connected.atom_id = atom.atom_id, connected.atom_id2 = atom.atom_id, connected.bond_id = bond.bond_id\n\t- Column: atom_id\n\t\t- Type: TEXT\n\t\t- Description: id of the first atom\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: atom_id2\n\t\t- Type: TEXT\n\t\t- Description: id of the second atom\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: bond_id\n\t\t- Type: TEXT\n\t\t- Description: bond id representing bond between two atoms\n\t\t- Sampled Values: TR000_1_2, TR000_2_3, TR000_2_4, TR000_2_5, TR001_10_11\n\n- Table: molecule\n\t- Description: The table molecule has 2 columns: ['molecule_id', 'label'].\n\t- Primary Key: molecule_id\n\t- Foreign Keys: \n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: unique id of molecule\n\t\t- Sampled Values: TR000, TR001, TR002, TR004, TR006\n\t- Column: label\n\t\t- Type: TEXT\n\t\t- Description: whether this molecule is carcinogenic or not\n\t\t- Sampled Values: +, -\n\n### Prior Knowledge:\n- label = '+' mean molecules are carcinogenic;\n\n### Requirements:\n* Please only return the SQL query to answer the question with no explanation.\n* Please format your answer into a Markdown code block as sql\n<YOUR SQL QUERY>\n.\n* Please do NOT select extra columns that are not explicitly requested in the query.\n* Ensure that the table and column names in the generated query exactly match those in the schema. Do NOT include any columns or tables that are not present in the provided schema.\n* Please ensure that the SQL query remains concise and avoids unnecessary joins with unrelated tables.\n\n### Question:\nHow many of the molecules are carcinogenic?\n\n### Output:\n",
"Prompt Length": 1026,
"Output Length": 20,

"Metadata": {
"model_response": {
"id": "chat-b2f6004c0c844c8d94b0e1ef95339882",
"object": "chat.completion",
"created": 1740116628,
"model": "deepseek-coder-33b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "sql\nSELECT COUNT(*) FROM molecule WHERE label = '+'\n",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 1316,
"total_tokens": 1336,

"completion_tokens": 20
},
"prompt_logprobs": null
},
"temperature": 0.0
}

Steps to Reproduce

send a curl to vllm and vllm will return the prompt_token. Compare this number with the local tokenizer's count,

Expected behavior

These two token counts should be the same

Environment

  • LLM Model used: deepseek-33b
@nwangfw nwangfw self-assigned this Feb 21, 2025
@nwangfw nwangfw changed the title Improving benchmarking accuracy Improving benchmarking token counting accuracy Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant