Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is based upon this one:
#77
As mentioned in the above, there is only so far that we can do with unit-tests based grading. There are lots of places where the task involves something that its not directly testable.
For example 001/008 asks the AI to use "helper functions" that it then calls from its queries. This is not testable via unit testing.
So that's what this PR is about, it allows us to test those tasks by introducing "AI Grading".
It feeds a model the task and the generated output and asks it to give a "pass / fail" and a couple of sentences to explain its reasoning.
This works very well in all my testing thus far.
I have it set to use
gpt-5-minifor now to keep costs low but it could use any model.You invoke it simple as a unit test in a
grader.test.tsfile so if it fails then the unit test fails. It logs the reasoning.I then asked an AI to go through all the tasks and work out which are not covered entirely by the grader tests and to add the AI based grading too. I asked it to give reasoning during this and I smoke tested a few and it seems logical.