-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traces in evaluation result are only saving the last prompt per trace name #1871
Comments
Hey @VladTabakovCortea thanks for reporting this, this seems to be a bug to me. @jjmachan can you check where this is coming from? |
This is because of the "prompt_traces" dictionary using the "prompt_trace.name" as the key which will not be unique when called more than once,
A quick solution, I would like to suggest is to append last 4 letter of the run_id, to make it unique.
which results to an output like this
Please provide your feedback on this solution. |
thanks to @Vidit-Ostwal for putting a PR to fix this ❤ @VladTabakovCortea we would love your opinion on a design decisions we were considering at #1880 if you get a chance |
Hey, im leaning towards an int, since just cutting off a run id by just 4 characters might lead to a collision. I think that would make it obvious which prompt was called first, it would resolve any collision issues too, but I dont know if its something that can be implemented as painlessly as just adding the int at the end, just think about it, I dont have a lot if context into the rest of the usages |
@VladTabakovCortea, Looking it from the user perspective, I think what you are suggesting make sense. This will lead to easy understanding of the flow. @jjmachan Can provide more insights whether he is aligned on this solution. If he gives a go-ahead on this. I will update the PR to include this modification. |
thanks @VladTabakovCortea 🙌🏽 @Vidit-Ostwal let do that then - I'll also exp with this from my end today. Thanks a lot for helping 🙂 |
[x] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug
Evaluation result's
traces
property only keeps one trace per trace.name, inconsistent withragas_traces
. So in FactualCorrectness there are multiple calls to decompose_claims for exampleRagas version: 0.2.11
Python version: 3.11
Code to Reproduce
As you can see in the example traces for that metric include only one prompt per prompt name, instead of 2 like it should, since in the code we call these metrics twice, this can be proven by checking score.ragas_traces
Error trace
Expected behavior
result.traces has ALL traces per prompt, including the metrics with the same name
Additional context
I think the problem is in ragas/callbacks.py::parse_run_traces() line 158, it assigns by metric name so if there are multiple metrics in the same call it will only save the last trace.
I couldnt find anythin in the open issues to correlate with my issue so I thought it was a new one, plus that does seem like unexpected behaviour, please let me know otherwise
The text was updated successfully, but these errors were encountered: