Traces in evaluation result are only saving the last prompt per trace name #1871

VladTabakovCortea · 2025-01-22T14:53:00Z

[x] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
Evaluation result's traces property only keeps one trace per trace.name, inconsistent with ragas_traces. So in FactualCorrectness there are multiple calls to decompose_claims for example

Ragas version: 0.2.11
Python version: 3.11

Code to Reproduce

 from ragas.dataset_schema import SingleTurnSample
 from ragas.metrics._factual_correctness import FactualCorrectness
 from ragas import evaluate
from datasets import Dataset

dataset = Dataset.from_dict({"response": ["Eifel tower is in Paris"], "reference": ["Paris, France is a city where Eifel tower is located"]})
fc = FactualCorrectness()
score = evaluate(
    dataset, metrics=[fc]
)
print(score.scores)
# [{'factual_correctness': 0.67}]
print(score.traces[0]['factual_correctness'])

{'claim_decomposition_prompt': {'input': ClaimDecompositionInput(response='Paris, France is a city where Eifel tower is located'),
                                'output': ClaimDecompositionOutput(claims=['Paris is a city in France.', 'The Eiffel Tower is located in Paris.'])},
 'n_l_i_statement_prompt': {'input': NLIStatementInput(context='Eifel tower is in Paris', statements=['Paris is a city in France.', 'The Eiffel Tower is located in Paris.']),
                            'output': NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Paris is a city in France.', reason='The statement about Paris being a city in France is a well-known fact and can be inferred from general knowledge, but it is not directly stated in the context provided.', verdict=0), StatementFaithfulnessAnswer(statement='The Eiffel Tower is located in Paris.', reason='The context explicitly states that the Eiffel Tower is in Paris, making this statement directly inferable.', verdict=1)])}}

As you can see in the example traces for that metric include only one prompt per prompt name, instead of 2 like it should, since in the code we call these metrics twice, this can be proven by checking score.ragas_traces

In [34]: [score.metadata for score in score.ragas_traces.values()]
Out[34]: 
[{'type': <ChainType.EVALUATION: 'evaluation'>},
 {'type': <ChainType.ROW: 'row'>, 'row_index': 0},
 {'type': <ChainType.METRIC: 'metric'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>},
 {'type': <ChainType.RAGAS_PROMPT: 'ragas_prompt'>}]

Error trace

Expected behavior
result.traces has ALL traces per prompt, including the metrics with the same name

Additional context
I think the problem is in ragas/callbacks.py::parse_run_traces() line 158, it assigns by metric name so if there are multiple metrics in the same call it will only save the last trace.

I couldnt find anythin in the open issues to correlate with my issue so I thought it was a new one, plus that does seem like unexpected behaviour, please let me know otherwise

The text was updated successfully, but these errors were encountered:

shahules786 · 2025-01-23T04:32:52Z

Hey @VladTabakovCortea thanks for reporting this, this seems to be a bug to me. @jjmachan can you check where this is coming from?

Vidit-Ostwal · 2025-01-23T17:59:30Z

@shahules786, @jjmachan

This is because of the "prompt_traces" dictionary using the "prompt_trace.name" as the key which will not be unique when called more than once,

This is inside ragas.metrics._factual_correctness Line 281,282,287,288

This is inside ragas.callbacks.py Line 170,171,172

def parse_run_traces(
    traces: t.Dict[str, ChainRun],
    parent_run_id: t.Optional[str] = None,
) -> t.List[t.Dict[str, t.Any]]:
    
    print(traces)
    print(parent_run_id)

    root_traces = [
        chain_trace
        for chain_trace in traces.values()
        if chain_trace.parent_run_id == parent_run_id
    ]

    if len(root_traces) > 1:
        raise ValueError(
            "Multiple root traces found! This is a bug on our end, please file an issue and we will fix it ASAP :)"
        )
    root_trace = root_traces[0]

    # get all the row traces
    parased_traces = []
    for row_uuid in root_trace.children:
        row_trace = traces[row_uuid]
        metric_traces = MetricTrace()
        for metric_uuid in row_trace.children:
            metric_trace = traces[metric_uuid]
            metric_traces.scores[metric_trace.name] = metric_trace.outputs.get(
                "output", {}
            )
            # get all the prompt IO from the metric trace
            prompt_traces = {}
            for i, prompt_uuid in enumerate(metric_trace.children):
                prompt_trace = traces[prompt_uuid]
                output = prompt_trace.outputs.get("output", {})
                output = output[0] if isinstance(output, list) else output
                prompt_traces[f"{prompt_trace.name}"] = {
                    "input": prompt_trace.inputs.get("data", {}),
                    "output": output,
                }
            metric_traces[f"{metric_trace.name}"] = prompt_traces
        parased_traces.append(metric_traces)

    return parased_traces

A quick solution, I would like to suggest is to append last 4 letter of the run_id, to make it unique.

            for i, prompt_uuid in enumerate(metric_trace.children):
                prompt_trace = traces[prompt_uuid]
                output = prompt_trace.outputs.get("output", {})
                output = output[0] if isinstance(output, list) else output
                prompt_traces[f"{prompt_trace.name}_{prompt_trace.run_id[:4]}"] = {
                    "input": prompt_trace.inputs.get("data", {}),
                    "output": output,
                }
            metric_traces[f"{metric_trace.name}"] = prompt_traces

which results to an output like this

{'claim_decomposition_prompt_8e37': {'input': ClaimDecompositionInput(response='Eifel tower is in Paris', sentences=['Eifel tower is in Paris']), 'output': ClaimDecompositionOutput(decomposed_claims=[['Eiffel
tower is in Paris']])}, 'n_l_i_statement_prompt_e622': {'input': NLIStatementInput(context='Paris, France is a city where Eifel tower is located', statements=['Eiffel tower is in Paris']), 'output':
NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Eiffel tower is in Paris', reason='The context explicitly states that Eiffel tower is located in Paris, France.', verdict=1)])},
'claim_decomposition_prompt_b5de': {'input': ClaimDecompositionInput(response='Paris, France is a city where Eifel tower is located', sentences=['Paris, France is a city where Eifel tower is located']),
'output': ClaimDecompositionOutput(decomposed_claims=[['Paris is a city in France.'], ['Eiffel Tower is located in Paris.']])}, 'n_l_i_statement_prompt_c96e': {'input': NLIStatementInput(context='Eifel tower is
in Paris', statements=['Paris is a city in France.', 'Eiffel Tower is located in Paris.']), 'output': NLIStatementOutput(statements=[StatementFaithfulnessAnswer(statement='Paris is a city in France.',
reason='The context does not provide any information about the location of Paris.', verdict=0), StatementFaithfulnessAnswer(statement='Eiffel Tower is located in Paris.', reason='The context explicitly states
that the Eiffel Tower is in Paris.', verdict=1)])}}

Please provide your feedback on this solution.
Do you have any alternative approaches that might be more effective?

jjmachan · 2025-01-30T05:51:25Z

thanks to @Vidit-Ostwal for putting a PR to fix this ❤

@VladTabakovCortea we would love your opinion on a design decisions we were considering at #1880 if you get a chance
we are thinking about adding either the run_id or a int to mark the difference

VladTabakovCortea · 2025-02-01T13:50:13Z

Hey, im leaning towards an int, since just cutting off a run id by just 4 characters might lead to a collision.
Honestly the only problem with either solution is prompt call order, its not clear which one was the first to be called, with ints its better though.
If possible replacing a mapping with a list of trace objects, and putting the prompt name into the object would solve the issue.
So instead of {'prompt_1': {trace}}
Something like [{'prompt':'decomposition prompt', **trace}]

I think that would make it obvious which prompt was called first, it would resolve any collision issues too, but I dont know if its something that can be implemented as painlessly as just adding the int at the end, just think about it, I dont have a lot if context into the rest of the usages

Vidit-Ostwal · 2025-02-01T14:02:51Z

@VladTabakovCortea, Looking it from the user perspective, I think what you are suggesting make sense. This will lead to easy understanding of the flow.
If I think from the implementation point of view. This should be fairly easy.

@jjmachan Can provide more insights whether he is aligned on this solution. If he gives a go-ahead on this. I will update the PR to include this modification.

jjmachan · 2025-02-01T17:53:24Z

thanks @VladTabakovCortea 🙌🏽

@Vidit-Ostwal let do that then - I'll also exp with this from my end today. Thanks a lot for helping 🙂

VladTabakovCortea added the bug Something isn't working label Jan 22, 2025

dosubot bot added the module-metrics this is part of metrics module label Jan 22, 2025

VladTabakovCortea changed the title ~~Traces in evaluation result are only saving the last prompt per metric name~~ Traces in evaluation result are only saving the last prompt per trace name Jan 22, 2025

shahules786 assigned jjmachan Jan 23, 2025

Vidit-Ostwal linked a pull request Jan 25, 2025 that will close this issue

Changed the parse_run_traces to include last 4 letters of run_id #1880

Open

jjmachan linked a pull request Jan 30, 2025 that will close this issue

Changed the parse_run_traces to include last 4 letters of run_id #1880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traces in evaluation result are only saving the last prompt per trace name #1871

Traces in evaluation result are only saving the last prompt per trace name #1871

VladTabakovCortea commented Jan 22, 2025

shahules786 commented Jan 23, 2025

Vidit-Ostwal commented Jan 23, 2025 •

edited

Loading

jjmachan commented Jan 30, 2025

VladTabakovCortea commented Feb 1, 2025

Vidit-Ostwal commented Feb 1, 2025

jjmachan commented Feb 1, 2025

Traces in evaluation result are only saving the last prompt per trace name #1871

Traces in evaluation result are only saving the last prompt per trace name #1871

Comments

VladTabakovCortea commented Jan 22, 2025

shahules786 commented Jan 23, 2025

Vidit-Ostwal commented Jan 23, 2025 • edited Loading

jjmachan commented Jan 30, 2025

VladTabakovCortea commented Feb 1, 2025

Vidit-Ostwal commented Feb 1, 2025

jjmachan commented Feb 1, 2025

Vidit-Ostwal commented Jan 23, 2025 •

edited

Loading