Mikec/02 prep and first evals #82
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is part 2 of 5. The last PR was: #80
This PRs goal is to lay the groundwork for the bulk of the evals changes that is to come in the next PR.
This PR fixed up the running of tests because right now they don't actually run on main due to some sort of out of order issue. To be honest im not entirely sure what's going on, it could be Windows vs OSX vs Linux issue. This PR fixes them up so that the tests run.
Added some more helper functions for graders too use to make clearer unit tests.
It adds a few evals, so I can test that these changes. The rest of them will be in the next part.
This PR also adds much more context in
_write_local_results
into the output filelocal_results.jsonl
. These changes are mainly used in the fourth PR (#84) but are also useful for LLMs to inspect the output of a given run when running locally.