Is this package capable of calculating tokens for OpenAI assistant mode and more advanced chats? #58

jasonsu123 · 2024-06-19T01:13:28Z

Hello,
I noticed that the code package you wrote is very impressive. However, is it only capable of counting tokens for regular simple chats?

I saw your code requires the input prompt to include "role", "user", and "content" strings.....

message_prompt = [{ "role": "user", "content": "Hello world"}]

If using the assistant mode with instructions, file search, and uploading files to vector stores for RAG, the calculation might be more complex.

Are the token calculation methods for gpt-4-1106-preview and gpt4o the same?
I checked the tokenizer on the official website, but the tokenizer for gpt4o is not yet available:
https://platform.openai.com/tokenizer

Currently, my code for calculating tokens is as follows. Is this correct?
Thank you.

import tiktoken encoding = tiktoken.encoding_for_model("gpt-4-1106-preview") token_contents = len(encoding.encode(contents))

The text was updated successfully, but these errors were encountered:

areibman · 2024-06-19T21:18:12Z

Assistants mode- Currently not looking at this capability. Would be a great function to add, but I must admit I'm not an expert with the assistants API
Indeed- the gpt-4 and gpt-4o tokenizers will be considered the same because the 4o tokenizer isn't out yet. This is an unexpected bug with our library. We don't have every single tokenizer loaded in yet, so we fall back to cl100k for models Tiktoken doesn't recognize. Not perfect but still better than nothing. Would love some help from folks who know where to get the other tokenizers from

jasonsu123 · 2024-06-19T23:51:32Z

Thank you very much for your response. This feature is already great.

As for your mention that GPT-4O currently does not have a dedicated token segmentation tool, could we change the model to cl100k as follows?
Thank you.

import tiktoken

encoding = tiktoken.encoding_for_model("cl100k")

token_contents = len(encoding.encode(contents))

areibman · 2024-06-19T23:54:12Z

Thank you very much for your response. This feature is already great.

As for your mention that GPT-4O currently does not have a dedicated token segmentation tool, could we change the model to cl100k as follows? Thank you.

import tiktoken

encoding = tiktoken.encoding_for_model("cl100k")

token_contents = len(encoding.encode(contents))

Can you explain what you mean? We currently use cl100k as a fallback:

tokencost/tokencost/costs.py

Line 101 in e1d52db

logging.warning("Warning: model not found. Using cl100k_base encoding.")

jasonsu123 · 2024-06-20T00:26:45Z

Yes, I later changed the model name in this code segment to cl100k_base, but I encountered an error when running the program. I'm not sure where the issue in my code is. Thank you.

import tiktoken
encoding = tiktoken.encoding_for_model("cl100k_base")
token_contents = len(encoding.encode(content))
print(f"The prompt contains {token_contents} tokens.")

The error message is:
in encoding_for_model
raise KeyError(
KeyError: 'Could not automatically map cl100k_base to a tokenizer. Please use `tiktoken.get_encodin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this package capable of calculating tokens for OpenAI assistant mode and more advanced chats? #58

Is this package capable of calculating tokens for OpenAI assistant mode and more advanced chats? #58

jasonsu123 commented Jun 19, 2024

areibman commented Jun 19, 2024

jasonsu123 commented Jun 19, 2024

areibman commented Jun 19, 2024

jasonsu123 commented Jun 20, 2024

Is this package capable of calculating tokens for OpenAI assistant mode and more advanced chats? #58

Is this package capable of calculating tokens for OpenAI assistant mode and more advanced chats? #58

Comments

jasonsu123 commented Jun 19, 2024

areibman commented Jun 19, 2024

jasonsu123 commented Jun 19, 2024

areibman commented Jun 19, 2024

jasonsu123 commented Jun 20, 2024