MikeC/03 the rest of the evals upgrades #83
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the third of 5 PRs, the previous one was: #82 the next one is #84
The goal of this PR is to upgrade all the evals and graders so they are fairly testing. Here is some text from my closed large PR: #77
Once I was able to run evals locally and view their run logs I noticed that often we were failing the eval because of a number of reasons that I don't think were actual valid reasons to fail the eval.
The grader was checking to literal schema and function shape and the task did not specify the expected shape thus there was wiggle room in the interpretability of the task and thus different models would output different code and thus would fail even thought the answer is correct when you compare it against the task.
So the bulk of the work in this PR is to "make the grading fair".
So that is to go through every single eval and compare the task to what we are grading it on in the grader.test.ts files. If the tests are expecting certain functions with certain arguments then we should be explicit in the task that the AI should generate with those exact names.
One I did that I went through them all again and looked to see if there were more tests we could add that would cover the task that handnt been written. This was an important thing to do because I removed the too-strict compareFunctionSpec and compareSchema from all the evals so now there was potential API surface that we needed to grade against.
There are however circumstances when the task specifies something that we are unable to test for using unit tests. For example 001/008 asks the AI to use "helper functions" that it then calls from its queries. This is not testable via unit testing, so thats what the AI grader PR is about, it allows us to test those tasks by introducing "AI Grading"