Mikec/02 prep and first evals #82

mikecann · 2025-10-21T03:51:42Z

This is part 2 of 5. The last PR was: #80

This PRs goal is to lay the groundwork for the bulk of the evals changes that is to come in the next PR.

This PR fixed up the running of tests because right now they don't actually run on main due to some sort of out of order issue. To be honest im not entirely sure what's going on, it could be Windows vs OSX vs Linux issue. This PR fixes them up so that the tests run.

Added some more helper functions for graders too use to make clearer unit tests.

It adds a few evals, so I can test that these changes. The rest of them will be in the next part.

This PR also adds much more context in _write_local_results into the output file local_results.jsonl. These changes are mainly used in the fourth PR (#84) but are also useful for LLMs to inspect the output of a given run when running locally.

mikecann added 6 commits October 21, 2025 10:56

from big pr

05f8841

dont need this right now

2ceb5c3

removing binary pass / fail

0562b1a

fixed so it actually runs okay now

ac43e97

removing these, not needed

ec0e3df

no lets use dev for now

49d3e72

This was referenced Oct 21, 2025

Mikec/01 adding gpt 5 #80

Merged

MikeC/03 the rest of the evals upgrades #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mikec/02 prep and first evals #82

Mikec/02 prep and first evals #82

Uh oh!

mikecann commented Oct 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mikec/02 prep and first evals #82

Are you sure you want to change the base?

Mikec/02 prep and first evals #82

Uh oh!

Conversation

mikecann commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikecann commented Oct 21, 2025 •

edited

Loading