You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that the python client side token count is not same as the VLLM token counting results. The difference is about 300 for input length. It may worth to take a look of this issue.
"Prompt": "You are an AI programming assistant and your task is to generate a SQL query based on the input database schema and user questions.\n### Task Description:\nGiven the following database schema, please write a SQL query to answer the given question.\n\n### Schema:\nThe database contains 4 tables: ['atom', 'bond', 'connected', 'molecule'].\n\n- Table: atom\n\t- Description: The table atom has 3 columns: ['atom_id', 'molecule_id', 'element'].\n\t- Primary Key: atom_id\n\t- Foreign Keys: atom.molecule_id = molecule.molecule_id\n\t- Column: atom_id\n\t\t- Type: TEXT\n\t\t- Description: the unique id of atoms\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: identifying the molecule to which the atom belongs\n\t\t- Sampled Values: TR000, TR001, TR002, TR003, TR004\n\t- Column: element\n\t\t- Type: TEXT\n\t\t- Description: the element of the toxicology\n\t\t- Sampled Values: b, br, c, ca, cl\n\n- Table: bond\n\t- Description: The table bond has 3 columns: ['bond_id', 'molecule_id', 'bond_type'].\n\t- Primary Key: bond_id\n\t- Foreign Keys: bond.molecule_id = molecule.molecule_id\n\t- Column: bond_id\n\t\t- Type: TEXT\n\t\t- Description: unique id representing bonds\n\t\t- Sampled Values: TR000_1_2, TR000_2_3, TR000_2_4, TR000_2_5, TR001_10_11\n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: identifying the molecule in which the bond appears\n\t\t- Sampled Values: TR000, TR001, TR002, TR003, TR004\n\t- Column: bond_type\n\t\t- Type: TEXT\n\t\t- Description: type of the bond\n\t\t- Sampled Values: None, #, -, =\n\n- Table: connected\n\t- Description: The table connected has 3 columns: ['atom_id', 'atom_id2', 'bond_id'].\n\t- Primary Key: ['atom_id', 'atom_id2']\n\t- Foreign Keys: connected.atom_id = atom.atom_id, connected.atom_id2 = atom.atom_id, connected.bond_id = bond.bond_id\n\t- Column: atom_id\n\t\t- Type: TEXT\n\t\t- Description: id of the first atom\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: atom_id2\n\t\t- Type: TEXT\n\t\t- Description: id of the second atom\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: bond_id\n\t\t- Type: TEXT\n\t\t- Description: bond id representing bond between two atoms\n\t\t- Sampled Values: TR000_1_2, TR000_2_3, TR000_2_4, TR000_2_5, TR001_10_11\n\n- Table: molecule\n\t- Description: The table molecule has 2 columns: ['molecule_id', 'label'].\n\t- Primary Key: molecule_id\n\t- Foreign Keys: \n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: unique id of molecule\n\t\t- Sampled Values: TR000, TR001, TR002, TR004, TR006\n\t- Column: label\n\t\t- Type: TEXT\n\t\t- Description: whether this molecule is carcinogenic or not\n\t\t- Sampled Values: +, -\n\n### Prior Knowledge:\n- label = '+' mean molecules are carcinogenic;\n\n### Requirements:\n* Please only return the SQL query to answer the question with no explanation.\n* Please format your answer into a Markdown code block as sql\n<YOUR SQL QUERY>\n.\n* Please do NOT select extra columns that are not explicitly requested in the query.\n* Ensure that the table and column names in the generated query exactly match those in the schema. Do NOT include any columns or tables that are not present in the provided schema.\n* Please ensure that the SQL query remains concise and avoids unnecessary joins with unrelated tables.\n\n### Question:\nHow many of the molecules are carcinogenic?\n\n### Output:\n", "Prompt Length": 1026,
"Output Length": 20,
"Metadata": {
"model_response": {
"id": "chat-b2f6004c0c844c8d94b0e1ef95339882",
"object": "chat.completion",
"created": 1740116628,
"model": "deepseek-coder-33b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "sql\nSELECT COUNT(*) FROM molecule WHERE label = '+'\n",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": { "prompt_tokens": 1316,
"total_tokens": 1336,
"completion_tokens": 20
},
"prompt_logprobs": null
},
"temperature": 0.0
}
Steps to Reproduce
send a curl to vllm and vllm will return the prompt_token. Compare this number with the local tokenizer's count,
Expected behavior
These two token counts should be the same
Environment
LLM Model used: deepseek-33b
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
It seems that the python client side token count is not same as the VLLM token counting results. The difference is about 300 for input length. It may worth to take a look of this issue.
"Prompt": "You are an AI programming assistant and your task is to generate a SQL query based on the input database schema and user questions.\n### Task Description:\nGiven the following database schema, please write a SQL query to answer the given question.\n\n### Schema:\nThe database contains 4 tables: ['atom', 'bond', 'connected', 'molecule'].\n\n- Table: atom\n\t- Description: The table atom has 3 columns: ['atom_id', 'molecule_id', 'element'].\n\t- Primary Key: atom_id\n\t- Foreign Keys: atom.molecule_id = molecule.molecule_id\n\t- Column: atom_id\n\t\t- Type: TEXT\n\t\t- Description: the unique id of atoms\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: identifying the molecule to which the atom belongs\n\t\t- Sampled Values: TR000, TR001, TR002, TR003, TR004\n\t- Column: element\n\t\t- Type: TEXT\n\t\t- Description: the element of the toxicology\n\t\t- Sampled Values: b, br, c, ca, cl\n\n- Table: bond\n\t- Description: The table bond has 3 columns: ['bond_id', 'molecule_id', 'bond_type'].\n\t- Primary Key: bond_id\n\t- Foreign Keys: bond.molecule_id = molecule.molecule_id\n\t- Column: bond_id\n\t\t- Type: TEXT\n\t\t- Description: unique id representing bonds\n\t\t- Sampled Values: TR000_1_2, TR000_2_3, TR000_2_4, TR000_2_5, TR001_10_11\n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: identifying the molecule in which the bond appears\n\t\t- Sampled Values: TR000, TR001, TR002, TR003, TR004\n\t- Column: bond_type\n\t\t- Type: TEXT\n\t\t- Description: type of the bond\n\t\t- Sampled Values: None, #, -, =\n\n- Table: connected\n\t- Description: The table connected has 3 columns: ['atom_id', 'atom_id2', 'bond_id'].\n\t- Primary Key: ['atom_id', 'atom_id2']\n\t- Foreign Keys: connected.atom_id = atom.atom_id, connected.atom_id2 = atom.atom_id, connected.bond_id = bond.bond_id\n\t- Column: atom_id\n\t\t- Type: TEXT\n\t\t- Description: id of the first atom\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: atom_id2\n\t\t- Type: TEXT\n\t\t- Description: id of the second atom\n\t\t- Sampled Values: TR000_1, TR000_2, TR000_3, TR000_4, TR000_5\n\t- Column: bond_id\n\t\t- Type: TEXT\n\t\t- Description: bond id representing bond between two atoms\n\t\t- Sampled Values: TR000_1_2, TR000_2_3, TR000_2_4, TR000_2_5, TR001_10_11\n\n- Table: molecule\n\t- Description: The table molecule has 2 columns: ['molecule_id', 'label'].\n\t- Primary Key: molecule_id\n\t- Foreign Keys: \n\t- Column: molecule_id\n\t\t- Type: TEXT\n\t\t- Description: unique id of molecule\n\t\t- Sampled Values: TR000, TR001, TR002, TR004, TR006\n\t- Column: label\n\t\t- Type: TEXT\n\t\t- Description: whether this molecule is carcinogenic or not\n\t\t- Sampled Values: +, -\n\n### Prior Knowledge:\n- label = '+' mean molecules are carcinogenic;\n\n### Requirements:\n* Please only return the SQL query to answer the question with no explanation.\n* Please format your answer into a Markdown code block as
sql\n<YOUR SQL QUERY>\n
.\n* Please do NOT select extra columns that are not explicitly requested in the query.\n* Ensure that the table and column names in the generated query exactly match those in the schema. Do NOT include any columns or tables that are not present in the provided schema.\n* Please ensure that the SQL query remains concise and avoids unnecessary joins with unrelated tables.\n\n### Question:\nHow many of the molecules are carcinogenic?\n\n### Output:\n","Prompt Length": 1026,
"Output Length": 20,
"Metadata": {
"model_response": {
"id": "chat-b2f6004c0c844c8d94b0e1ef95339882",
"object": "chat.completion",
"created": 1740116628,
"model": "deepseek-coder-33b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "
sql\nSELECT COUNT(*) FROM molecule WHERE label = '+'\n
","tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 1316,
"total_tokens": 1336,
"completion_tokens": 20
},
"prompt_logprobs": null
},
"temperature": 0.0
}
Steps to Reproduce
send a curl to vllm and vllm will return the prompt_token. Compare this number with the local tokenizer's count,
Expected behavior
These two token counts should be the same
Environment
The text was updated successfully, but these errors were encountered: