Improved Score Calculation with LLMs #141

GODrums · 2024-11-09T22:58:58Z

Objective

We notice that our current score calculation for the leaderboard doesn't appropriately take the reviewing effort invested into account. For example manual tests on Artemis servers take quite a long time and evaluate to the same or less score than a simple code review with a few nit-picks.

A fair effort evaluation often requires decent Natural Language Processing (NLP) and a well-trained LLM. As estimations with a certain degree of accurate are sufficient for our use case of score calculation, we aim to estimate the process with prompt engineering on general-purpose LLMs, like GPT4o.

General idea

First: classification test - decide if review was manual testing, code review, etc.
Next: LLM complexity assessment - context-based scoring. Give e.g. PR as context in prompt and ask LLM about evaluation

Tasks

Prompt engineering
- Endpoint for Score-Evaluation per LLM request
- Reviews as prompt
- Provide prompt context for LLM
LLM in score calulation
- Request from evaluation from LLM endpoint
- Store LLM score/response in review model
- Switch from old scoring algorithm to new one in prod

GODrums self-assigned this Nov 9, 2024

GODrums added enhancement New feature or request application-server intelligence-service research labels Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Score Calculation with LLMs #141

Improved Score Calculation with LLMs #141

GODrums commented Nov 9, 2024

Improved Score Calculation with LLMs #141

Improved Score Calculation with LLMs #141

Comments

GODrums commented Nov 9, 2024

Objective

General idea

Tasks