Skip to content

Conversation

mikecann
Copy link
Contributor

@mikecann mikecann commented Oct 21, 2025

This is the third of 5 PRs, the previous one was: #82 the next one is #84

The goal of this PR is to upgrade all the evals and graders so they are fairly testing. Here is some text from my closed large PR: #77


Once I was able to run evals locally and view their run logs I noticed that often we were failing the eval because of a number of reasons that I don't think were actual valid reasons to fail the eval.

The grader was checking to literal schema and function shape and the task did not specify the expected shape thus there was wiggle room in the interpretability of the task and thus different models would output different code and thus would fail even thought the answer is correct when you compare it against the task.

So the bulk of the work in this PR is to "make the grading fair".

So that is to go through every single eval and compare the task to what we are grading it on in the grader.test.ts files. If the tests are expecting certain functions with certain arguments then we should be explicit in the task that the AI should generate with those exact names.

One I did that I went through them all again and looked to see if there were more tests we could add that would cover the task that handnt been written. This was an important thing to do because I removed the too-strict compareFunctionSpec and compareSchema from all the evals so now there was potential API surface that we needed to grade against.

There are however circumstances when the task specifies something that we are unable to test for using unit tests. For example 001/008 asks the AI to use "helper functions" that it then calls from its queries. This is not testable via unit testing, so thats what the AI grader PR is about, it allows us to test those tasks by introducing "AI Grading"

@mikecann mikecann changed the title upgraded the rest of the evals MikeC/03 the rest of the evals upgrades Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant