-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Bring evaluation more tightly into the repo #2233
base: main
Are you sure you want to change the base?
Conversation
azd env set USE_EVAL true | ||
``` | ||
|
||
2. Set the capacity to the highest possible value to ensure that the evaluation runs quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we state that it's expected that evals take a longer time to run in the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep sounds good, can do
Install all the dependencies for the evaluation script by running the following command: | ||
|
||
```bash | ||
pip install -r evals/requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we create a 2nd environment? Because we don't need it in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirements can be a bit hairy to install, so I don't want to put them in the main requirements. You dont need a second venv though.
I won't merge this until the eval SDK removes the promptflow dependency, and that'll make the install even smoother.
body = { | ||
"messages": [{"content": query, "role": "user"}], | ||
"stream": stream, | ||
"context": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be read from config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! This is what is very awkward about how our app currently works: the frontend is the source of truth for the parameter values. If you don't send anything to the API, then you may very well end up with a set of parameters that do not match what your app uses.
Ideally, the backend would have its own default set of values, send those to the frontend, and the frontend settings UI would update based on those. Then you could send a bare request to /chat and have it inherit the right default values.
For now, you have to copy/paste what gets sent over the wire from the frontend, which is awkward.
Oh, but also: I'm going to remove this file entirely, as I didn't find this to be a useful way to generate the ground truth. RAGAS is easier.
This discussion is still relevant for the evaluation flow though, as you currently have to copy/paste parameters into evaluate_config.json
@@ -0,0 +1,24 @@ | |||
{"question": "Could you provide details on the insurance coverage Contoso offers for work-related injuries? Additionally, I'd like to know more about the overall benefits package available to employees.", "truth": "Contoso offers Workers' Compensation Insurance coverage through Northwind Health for work-related injuries or illnesses [Northwind_Standard_Benefits_Details.pdf#page=101][Northwind_Health_Plus_Benefits_Details.pdf#page=106]. This coverage includes medical care, wage replacement, vocational rehabilitation, and death benefits [Northwind_Standard_Benefits_Details.pdf#page=101]. Employees should report any injuries or illnesses to their supervisor as soon as possible and ensure that the appropriate paperwork is on file to be eligible for Workers' Compensation Insurance coverage [Northwind_Health_Plus_Benefits_Details.pdf#page=107]. As for the overall benefits package, please provide more specific details or refer to the employee handbook."} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you generate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the ground truth generated from the azure-ai-eval simulator. I didn't think this was as useful, so as long as RAGAS looks good, I'll remove this file and the non-RAGAS generation.
dotenv-azd==0.2.0 | ||
azure-ai-evaluation==1.0.1 | ||
rich | ||
git+https://github.com/Azure-Samples/ai-rag-chat-evaluator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add a version/branch
simulations = generate_config["simulations"] | ||
num_per_task = generate_config.get("num_per_task", 2) | ||
|
||
asyncio.run(generate_ground_truth(azure_credential, simulations, num_per_task)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: gist
Purpose
This PR makes it easier to run evaluations by bringing the evaluation SDK and tools directly into the repo. These scripts still use the ai-rag-chat-evaluator for its custom evaluation metrics and evaluation review CLI tools, but I've moved the ground truth generation directly into the repo, as I've found that it is often very specific to the needs of the repo.
This PR currently uses the new Simulator from azure-ai-evaluation, but I might try RAGAS as well. It's hard to get good ground truth data that sufficiently covers the whole knowledge base.
Does this introduce a breaking change?
When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.
Does this require changes to learn.microsoft.com docs?
This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.
Type of change
Code quality checklist
See CONTRIBUTING.md for more details.
python -m pytest
).python -m pytest --cov
to verify 100% coverage of added linespython -m mypy
to check for type errorsruff
andblack
manually on my code.