I've been reviewing the benchmark evaluation code and noticed that Hindsight appears to use dataset-specific prompt optimization that other compared frameworks do not receive.
This raises a serious concern about the fairness and validity of the reported benchmark results.
Evidence
In hindsight.py, there is a dataset-specific retain_mission prompt engineered specifically for the BEAM dataset:
# hindsight.py - BEAM dataset-specific instructions
_BEAM_RETAIN_MISSION = (
"Extract ALL factual claims the user makes about themselves... "
"including NEGATIVE statements (e.g. 'I have never done X')..."
"Also preserve contradictions..."
)
This prompt is carefully crafted to capture the exact patterns that BEAM evaluates — negative self-statements, contradictions, and exhaustive factual claims.
Meanwhile, the competing frameworks (Mem0, Cognee, Mastra, etc.) are evaluated using their generic, out-of-the-box configurations with no equivalent dataset-aware tuning.
Why This Is Problematic
-
Unequal comparison. Giving one framework dataset-specific instructions while leaving others on default settings is not a fair evaluation. It measures "how well can you tune prompts for a specific test" rather than "how well does the framework perform."
-
Overfitting to the benchmark. Tailoring the retain mission to capture negative statements and contradictions — patterns specifically scored by BEAM — is functionally equivalent to overfitting to a test set. The resulting scores do not reflect real-world generalization.
-
Misleading to users. Anyone relying on these benchmarks to make an adoption decision is being presented with an apples-to-oranges comparison.
Questions
- Is there a justification for why Hindsight receives dataset-specific prompt tuning while other frameworks do not?
- Would you be willing to publish results where all frameworks either (a) use their default configs, or (b) are each given equivalent dataset-specific tuning?
- Are there other datasets in the benchmark suite that have similar framework-specific optimizations?
I'd appreciate transparency on this. Benchmarks are only useful when the methodology is sound.
I've been reviewing the benchmark evaluation code and noticed that Hindsight appears to use dataset-specific prompt optimization that other compared frameworks do not receive.
This raises a serious concern about the fairness and validity of the reported benchmark results.
Evidence
In
hindsight.py, there is a dataset-specificretain_missionprompt engineered specifically for the BEAM dataset:This prompt is carefully crafted to capture the exact patterns that BEAM evaluates — negative self-statements, contradictions, and exhaustive factual claims.
Meanwhile, the competing frameworks (Mem0, Cognee, Mastra, etc.) are evaluated using their generic, out-of-the-box configurations with no equivalent dataset-aware tuning.
Why This Is Problematic
Unequal comparison. Giving one framework dataset-specific instructions while leaving others on default settings is not a fair evaluation. It measures "how well can you tune prompts for a specific test" rather than "how well does the framework perform."
Overfitting to the benchmark. Tailoring the retain mission to capture negative statements and contradictions — patterns specifically scored by BEAM — is functionally equivalent to overfitting to a test set. The resulting scores do not reflect real-world generalization.
Misleading to users. Anyone relying on these benchmarks to make an adoption decision is being presented with an apples-to-oranges comparison.
Questions
I'd appreciate transparency on this. Benchmarks are only useful when the methodology is sound.