Skip to content

Conversation

@VonEquinox
Copy link
Contributor

…en counting accuracy

  • Fix HTTP client resource leak by implementing FastAPI lifespan manager
  • Fix security vulnerability in authorization header parsing (case-insensitive)
  • Fix incorrect timestamps using file modification time instead of current time
  • Fix silent JSON parsing failures with proper logging
  • Fix token counting to include tool_calls and system prompts
  • Add comprehensive test script for token counting validation

Resolves:

  • Memory leaks in long-running deployments
  • Potential authorization bypass vulnerability
  • Inaccurate token usage reporting (20-50% underestimation)
  • Debugging difficulties due to silent failures

…en counting accuracy

- Fix HTTP client resource leak by implementing FastAPI lifespan manager
- Fix security vulnerability in authorization header parsing (case-insensitive)
- Fix incorrect timestamps using file modification time instead of current time
- Fix silent JSON parsing failures with proper logging
- Fix token counting to include tool_calls and system prompts
- Add comprehensive test script for token counting validation

Resolves:
- Memory leaks in long-running deployments
- Potential authorization bypass vulnerability
- Inaccurate token usage reporting (20-50% underestimation)
- Debugging difficulties due to silent failures
@funnycups funnycups requested a review from Copilot October 19, 2025 11:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Fixes multiple production issues: HTTP client lifecycle leaks, case-insensitive Authorization handling, timestamp correctness, more robust JSON coercion, and significantly improved token counting (including tool_calls). Also adds a script to validate token counting.

  • Implement FastAPI lifespan to initialize/close a shared httpx.AsyncClient
  • Harden Authorization header parsing and improve token counting for tool_calls and system prompts
  • Replace file mtime timestamps with current time; improve JSON coercion logs; add token counting test script

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
test_token_counting.py Adds a script to exercise token counting across scenarios and models.
main.py Introduces lifespan-managed http client, strengthens Authorization parsing, improves token counting (tool_calls and timing), fixes timestamps, and enhances JSON coercion logging.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +1149 to +1163
# Count content tokens
content_tokens = token_counter.count_text_tokens(completion_text, body.model) if completion_text else 0

# Count tool_calls tokens if present
tool_calls_tokens = 0
if completion_tool_calls:
# Create a temporary message to count tool_calls tokens
temp_msg = {"role": "assistant", "tool_calls": completion_tool_calls}
# Count just the tool_calls part
tool_calls_tokens = token_counter.count_tokens([temp_msg], body.model)
# Subtract the message overhead to get just tool_calls tokens
tool_calls_tokens -= 3 # Remove assistant priming overhead
logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls")

estimated_completion_tokens = content_tokens + tool_calls_tokens
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool_calls token isolation subtracts only the reply priming (+3) but not the per-message overhead (3 for most models, 4 for gpt-3.5-turbo/gpt-35-turbo). This overestimates tool_calls tokens by 3–4 tokens. Subtract both the per-message overhead and the reply priming, e.g.: compute msg_overhead = 4 if model startswith(('gpt-3.5-turbo','gpt-35-turbo')) else 3, then do tool_calls_tokens -= (msg_overhead + 3). Even better, add a TokenCounter helper to return these overheads to avoid duplicating model-specific logic.

Copilot uses AI. Check for mistakes.
Comment on lines +1367 to +1368
tool_calls_tokens -= 3 # Remove assistant priming overhead
logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls in stream")
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as non-stream path: only the reply priming (+3) is removed, but the per-message overhead isn't. This inflates tool_calls token counts by 3–4 tokens. Subtract both the per-message overhead (3 for most models, 4 for gpt-3.5-turbo/gpt-35-turbo) and the +3 reply priming.

Suggested change
tool_calls_tokens -= 3 # Remove assistant priming overhead
logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls in stream")
# Subtract both per-message overhead and reply priming (+3)
# Most models: per-message overhead is 3, gpt-3.5-turbo/gpt-35-turbo: 4
per_message_overhead = 4 if body.model in ("gpt-3.5-turbo", "gpt-35-turbo") else 3
tool_calls_tokens -= (per_message_overhead + 3)
logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls in stream (subtracted {per_message_overhead}+3 overhead)")

Copilot uses AI. Check for mistakes.
import os

# Add parent directory to path
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says 'Add parent directory', but the code adds the script's directory. Either update the comment or actually add the parent with os.path.dirname(os.path.dirname(os.path.abspath(file))).

Suggested change
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

Copilot uses AI. Check for mistakes.
Comment on lines +1149 to +1163
# Count content tokens
content_tokens = token_counter.count_text_tokens(completion_text, body.model) if completion_text else 0

# Count tool_calls tokens if present
tool_calls_tokens = 0
if completion_tool_calls:
# Create a temporary message to count tool_calls tokens
temp_msg = {"role": "assistant", "tool_calls": completion_tool_calls}
# Count just the tool_calls part
tool_calls_tokens = token_counter.count_tokens([temp_msg], body.model)
# Subtract the message overhead to get just tool_calls tokens
tool_calls_tokens -= 3 # Remove assistant priming overhead
logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls")

estimated_completion_tokens = content_tokens + tool_calls_tokens
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to estimate completion tokens (content + tool_calls with model-specific overhead corrections) is duplicated here and in the streaming path. Consider extracting a single helper (e.g., TokenCounter.count_tool_calls_tokens(tool_calls, model) and a shared estimate_completion_tokens(content, tool_calls, model)) to centralize the model-specific overhead handling and avoid divergence.

Copilot uses AI. Check for mistakes.
Comment on lines +957 to +966
# Check authorization header format (case-insensitive)
if not authorization.lower().startswith("bearer "):
raise HTTPException(status_code=401, detail="Invalid authorization header format")

# Extract the token (everything after "Bearer ")
client_key = authorization[7:].strip()

# Ensure the key is not empty
if not client_key:
raise HTTPException(status_code=401, detail="Missing API key")
Copy link

Copilot AI Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing Authorization by slicing index 7 assumes a single space and no extra whitespace. Use robust parsing to handle multiple spaces/tabs: scheme, _, token = authorization.partition(' '); if scheme.lower() != 'bearer' or not token.strip(): raise HTTPException(...). This avoids edge cases and keeps the check case-insensitive.

Copilot uses AI. Check for mistakes.
@funnycups
Copy link
Owner

Thank you for your contribution! After a careful review I would suggest following changes:

  1. Remove unnecessary test_token_counting.py file.
  2. Counting tokens based on the actual prompt and the model’s generated XML/text instructions is more accurate. This approach is already implemented in the codebase, so the estimation logic for tool calls can now be removed.
  3. I saw your contribution to unify some mixed usage of bytes and str. However there's still a mix of str and bytes in the streaming SSE output. I would suggest use bytes only in all SSE yields.

Thanks again for your great work on Toolify!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants