MikeC/03 the rest of the evals upgrades #83

mikecann · 2025-10-21T05:01:34Z

This is the third of 5 PRs, the previous one was: #82 the next one is #84

The goal of this PR is to upgrade all the evals and graders so they are fairly testing. Here is some text from my closed large PR: #77

Once I was able to run evals locally and view their run logs I noticed that often we were failing the eval because of a number of reasons that I don't think were actual valid reasons to fail the eval.

The grader was checking to literal schema and function shape and the task did not specify the expected shape thus there was wiggle room in the interpretability of the task and thus different models would output different code and thus would fail even thought the answer is correct when you compare it against the task.

So the bulk of the work in this PR is to "make the grading fair".

So that is to go through every single eval and compare the task to what we are grading it on in the grader.test.ts files. If the tests are expecting certain functions with certain arguments then we should be explicit in the task that the AI should generate with those exact names.

One I did that I went through them all again and looked to see if there were more tests we could add that would cover the task that handnt been written. This was an important thing to do because I removed the too-strict compareFunctionSpec and compareSchema from all the evals so now there was potential API surface that we needed to grade against.

There are however circumstances when the task specifies something that we are unable to test for using unit tests. For example 001/008 asks the AI to use "helper functions" that it then calls from its queries. This is not testable via unit testing, so thats what the AI grader PR is about, it allows us to test those tasks by introducing "AI Grading"

upgraded the rest of the evals

51bc5a6

mikecann changed the title ~~upgraded the rest of the evals~~ MikeC/03 the rest of the evals upgrades Oct 21, 2025

This was referenced Oct 21, 2025

Mikec/02 prep and first evals #82

Open

MikeC/04-local-results-visualiser #84

Open

Mikec/05 ai grader #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MikeC/03 the rest of the evals upgrades #83

MikeC/03 the rest of the evals upgrades #83

Uh oh!

mikecann commented Oct 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MikeC/03 the rest of the evals upgrades #83

Are you sure you want to change the base?

MikeC/03 the rest of the evals upgrades #83

Uh oh!

Conversation

mikecann commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikecann commented Oct 21, 2025 •

edited

Loading