You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We notice that our current score calculation for the leaderboard doesn't appropriately take the reviewing effort invested into account. For example manual tests on Artemis servers take quite a long time and evaluate to the same or less score than a simple code review with a few nit-picks.
A fair effort evaluation often requires decent Natural Language Processing (NLP) and a well-trained LLM. As estimations with a certain degree of accurate are sufficient for our use case of score calculation, we aim to estimate the process with prompt engineering on general-purpose LLMs, like GPT4o.
General idea
First: classification test - decide if review was manual testing, code review, etc.
Next: LLM complexity assessment - context-based scoring. Give e.g. PR as context in prompt and ask LLM about evaluation
Tasks
Prompt engineering
Endpoint for Score-Evaluation per LLM request
Reviews as prompt
Provide prompt context for LLM
LLM in score calulation
Request from evaluation from LLM endpoint
Store LLM score/response in review model
Switch from old scoring algorithm to new one in prod
The text was updated successfully, but these errors were encountered:
Objective
We notice that our current score calculation for the leaderboard doesn't appropriately take the reviewing effort invested into account. For example manual tests on Artemis servers take quite a long time and evaluate to the same or less score than a simple code review with a few nit-picks.
A fair effort evaluation often requires decent Natural Language Processing (NLP) and a well-trained LLM. As estimations with a certain degree of accurate are sufficient for our use case of score calculation, we aim to estimate the process with prompt engineering on general-purpose LLMs, like GPT4o.
General idea
Tasks
The text was updated successfully, but these errors were encountered: