Skip to content

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Oct 7, 2025

This PR adds benchmark results for the qwen/qwen2.5-coder-7b-instruct model.

The following files have been updated:

  • src/benchmark/results.json - Raw benchmark results
  • src/benchmark/validation-results.json - Validation results against human baseline

This PR was automatically generated by the benchmark workflow.

Note: If you don't want to merge this PR, close it and the model will be added to the untested list to prevent re-processing.

@alrocar

@vercel
Copy link

vercel bot commented Oct 7, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
llm-benchmark Ready Ready Preview Comment Oct 7, 2025 6:14am

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

"details": "Results don't match",
"humanRowCount": 1,
"llmRowCount": 0,
"sql": "I previously asked: \"I previously asked: \"Count all stars\"\n\nYou generated this SQL query:\n\nCount all stars for a given repository\nSelect all stars for a given repository\nSelect all stars for a given repository and user\nSelect all stars for a given repository and user and date range\nSelect all stars for a given repository and user and date range and order by created_at\nSelect all stars for a given repository and user and date range and order by created_at and limit 10\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 1\n\n\nBut it resulted in this error:\n\n{\"error\": \"DB::Exception: Syntax error: failed at position 1 ('Count') (line 1, col 1): Count all stars for a given repository\\nSelect all stars for a given repository\\nSelect all stars for a given repository and user\\nSelect all stars for a given rep. Expected one of: Query, Query with output, EXPLAIN, EXPLAIN, SELECT query, possibly with UNION, list of union elements, SELECT query, subquery, possibly with UNION, SELECT subquery, SELECT query, WITH, FROM, SELECT, SHOW CREATE QUOTA query, SHOW C\n\n\nPlease fix the SQL query to correctly answer my original question. Make sure the SQL is valid for Tinybird/ClickHouse.\"\n\nYou generated this SQL query:\n\nI previously asked: \"Count all stars\"\n\nYou generated this SQL query:\n\nCount all stars for a given repository\nSelect all stars for a given repository\nSelect all stars for a given repository and user\nSelect all stars for a given repository and user and date range\nSelect all stars for a given repository and user and date range and order by created_at\nSelect all stars for a given repository and user and date range and order by created_at and limit 10\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars\nSelect all stars for a given repository and user and date range and order by created_at and limit 10 and offset 10 and group by actor_login and count(*) as stars and order by stars desc and limit 10 and offset 10 and sum(stars) as total_stars and sum(stars) as total_stars\nSelect all stars for a given repository and user and"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Benchmark Results Contain Unfiltered SQL Output

The sql field in the benchmark results contains raw, uncleaned model output, including extraneous conversational and error content, instead of just the generated SQL query. This makes the results file malformed.

Fix in Cursor Fix in Web

"qwen3-coder-plus",
"qwen3-coder-flash"
"qwen3-coder-flash",
"qwen2.5-coder-7b-instruct"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Model Name Prefix Mismatch

The qwen2.5-coder-7b-instruct model was added to the config without the qwen/ prefix. This name differs from the qwen/qwen2.5-coder-7b-instruct mentioned in the PR description and for benchmark results, which may prevent the system from correctly matching configurations.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants