Skip to content

Conversation

@mikecann
Copy link
Contributor

This PR is based upon this one:
#77

As mentioned in the above, there is only so far that we can do with unit-tests based grading. There are lots of places where the task involves something that its not directly testable.

For example 001/008 asks the AI to use "helper functions" that it then calls from its queries. This is not testable via unit testing.

So that's what this PR is about, it allows us to test those tasks by introducing "AI Grading".

It feeds a model the task and the generated output and asks it to give a "pass / fail" and a couple of sentences to explain its reasoning.

This works very well in all my testing thus far.

I have it set to use gpt-5-mini for now to keep costs low but it could use any model.

You invoke it simple as a unit test in a grader.test.ts file so if it fails then the unit test fails. It logs the reasoning.

I then asked an AI to go through all the tasks and work out which are not covered entirely by the grader tests and to add the AI based grading too. I asked it to give reasoning during this and I smoke tested a few and it seems logical.

@mikecann mikecann requested a review from jordanhunt22 August 27, 2025 02:23
@mikecann mikecann mentioned this pull request Oct 21, 2025
@mikecann
Copy link
Contributor Author

closing in favour of #85 which build on smaller chunks

@mikecann mikecann closed this Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant