WIP: Bring evaluation more tightly into the repo #2233

pamelafox · 2024-12-13T01:13:19Z

Purpose

This PR makes it easier to run evaluations by bringing the evaluation SDK and tools directly into the repo. These scripts still use the ai-rag-chat-evaluator for its custom evaluation metrics and evaluation review CLI tools, but I've moved the ground truth generation directly into the repo, as I've found that it is often very specific to the needs of the repo.

This PR currently uses the new Simulator from azure-ai-evaluation, but I might try RAGAS as well. It's hard to get good ground truth data that sufficiently covers the whole knowledge base.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[X] Yes - We'd update the evaluation tutorial!
[ ] No

Type of change

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

The current tests all pass (python -m pytest).
I added tests that prove my fix is effective or that my feature works
I ran python -m pytest --cov to verify 100% coverage of added lines
I ran python -m mypy to check for type errors
I either used the pre-commit hooks or ran ruff and black manually on my code.

mattgotteiner · 2025-01-08T23:30:14Z

docs/evaluation.md

+    azd env set USE_EVAL true
+    ```
+
+2. Set the capacity to the highest possible value to ensure that the evaluation runs quickly.


should we state that it's expected that evals take a longer time to run in the documentation?

Yep sounds good, can do

mattgotteiner · 2025-01-08T23:30:39Z

docs/evaluation.md

+Install all the dependencies for the evaluation script by running the following command:
+
+```bash
+pip install -r evals/requirements.txt


Why do we create a 2nd environment? Because we don't need it in production?

The requirements can be a bit hairy to install, so I don't want to put them in the main requirements. You dont need a second venv though.
I won't merge this until the eval SDK removes the promptflow dependency, and that'll make the install even smoother.

mattgotteiner · 2025-01-08T23:31:58Z

evals/generate_ground_truth.py

+    body = {
+        "messages": [{"content": query, "role": "user"}],
+        "stream": stream,
+        "context": {


Does this need to be read from config?

Great question! This is what is very awkward about how our app currently works: the frontend is the source of truth for the parameter values. If you don't send anything to the API, then you may very well end up with a set of parameters that do not match what your app uses.
Ideally, the backend would have its own default set of values, send those to the frontend, and the frontend settings UI would update based on those. Then you could send a bare request to /chat and have it inherit the right default values.
For now, you have to copy/paste what gets sent over the wire from the frontend, which is awkward.

Oh, but also: I'm going to remove this file entirely, as I didn't find this to be a useful way to generate the ground truth. RAGAS is easier.

This discussion is still relevant for the evaluation flow though, as you currently have to copy/paste parameters into evaluate_config.json

mattgotteiner · 2025-01-08T23:33:29Z

evals/ground_truth.jsonl

@@ -0,0 +1,24 @@
+{"question": "Could you provide details on the insurance coverage Contoso offers for work-related injuries? Additionally, I'd like to know more about the overall benefits package available to employees.", "truth": "Contoso offers Workers' Compensation Insurance coverage through Northwind Health for work-related injuries or illnesses [Northwind_Standard_Benefits_Details.pdf#page=101][Northwind_Health_Plus_Benefits_Details.pdf#page=106]. This coverage includes medical care, wage replacement, vocational rehabilitation, and death benefits [Northwind_Standard_Benefits_Details.pdf#page=101]. Employees should report any injuries or illnesses to their supervisor as soon as possible and ensure that the appropriate paperwork is on file to be eligible for Workers' Compensation Insurance coverage [Northwind_Health_Plus_Benefits_Details.pdf#page=107]. As for the overall benefits package, please provide more specific details or refer to the employee handbook."}


How did you generate this?

This looks like the ground truth generated from the azure-ai-eval simulator. I didn't think this was as useful, so as long as RAGAS looks good, I'll remove this file and the non-RAGAS generation.

pamelafox · 2025-01-09T20:22:50Z

evals/requirements.txt

+dotenv-azd==0.2.0
+azure-ai-evaluation==1.0.1
+rich
+git+https://github.com/Azure-Samples/ai-rag-chat-evaluator


TODO: Add a version/branch

pamelafox · 2025-01-09T20:28:22Z

evals/generate_ground_truth.py

+        simulations = generate_config["simulations"]
+        num_per_task = generate_config.get("num_per_task", 2)
+
+    asyncio.run(generate_ground_truth(azure_credential, simulations, num_per_task))


pamelafox added 9 commits December 11, 2024 09:29

Updating docs

34cc860

Update requirements.txt

42cd5d1

Update diagram

98bc256

Add typing extensions explicitly

897f4bf

Adding ground truth generation

e757646

Merge branch 'main' into evals

8849656

Add evaluate flow as well

314e2f7

Add RAGAS

dd6780e

Add RAGAS

8a70c5b

mattgotteiner reviewed Jan 8, 2025

View reviewed changes

pamelafox commented Jan 9, 2025

View reviewed changes

pamelafox mentioned this pull request Jan 10, 2025

This AI sample lacks risk & safety evaluation implementation #2262

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Bring evaluation more tightly into the repo #2233

WIP: Bring evaluation more tightly into the repo #2233

pamelafox commented Dec 13, 2024

mattgotteiner Jan 8, 2025

pamelafox Jan 8, 2025

mattgotteiner Jan 8, 2025

pamelafox Jan 8, 2025

mattgotteiner Jan 8, 2025

pamelafox Jan 8, 2025

mattgotteiner Jan 8, 2025

pamelafox Jan 8, 2025

pamelafox Jan 9, 2025

pamelafox Jan 9, 2025

		@@ -0,0 +1,24 @@
		{"question": "Could you provide details on the insurance coverage Contoso offers for work-related injuries? Additionally, I'd like to know more about the overall benefits package available to employees.", "truth": "Contoso offers Workers' Compensation Insurance coverage through Northwind Health for work-related injuries or illnesses [Northwind_Standard_Benefits_Details.pdf#page=101][Northwind_Health_Plus_Benefits_Details.pdf#page=106]. This coverage includes medical care, wage replacement, vocational rehabilitation, and death benefits [Northwind_Standard_Benefits_Details.pdf#page=101]. Employees should report any injuries or illnesses to their supervisor as soon as possible and ensure that the appropriate paperwork is on file to be eligible for Workers' Compensation Insurance coverage [Northwind_Health_Plus_Benefits_Details.pdf#page=107]. As for the overall benefits package, please provide more specific details or refer to the employee handbook."}

WIP: Bring evaluation more tightly into the repo #2233

Are you sure you want to change the base?

WIP: Bring evaluation more tightly into the repo #2233

Conversation

pamelafox commented Dec 13, 2024

Purpose

Does this introduce a breaking change?

Does this require changes to learn.microsoft.com docs?

Type of change

Code quality checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment