Benchmark Integrity Concern: Dataset-Specific Prompt Tuning

I've been reviewing the benchmark evaluation code and noticed that Hindsight appears to use **dataset-specific prompt optimization** that other compared frameworks do not receive.

This raises a serious concern about the fairness and validity of the reported benchmark results.

### Evidence

In `hindsight.py`, there is a dataset-specific `retain_mission` prompt engineered specifically for the BEAM dataset:

```python
# hindsight.py - BEAM dataset-specific instructions
_BEAM_RETAIN_MISSION = (
    "Extract ALL factual claims the user makes about themselves... "
    "including NEGATIVE statements (e.g. 'I have never done X')..."
    "Also preserve contradictions..."
)
```

This prompt is carefully crafted to capture the exact patterns that BEAM evaluates — negative self-statements, contradictions, and exhaustive factual claims.

Meanwhile, the competing frameworks (Mem0, Cognee, Mastra, etc.) are evaluated using their **generic, out-of-the-box configurations** with no equivalent dataset-aware tuning.

### Why This Is Problematic

1. **Unequal comparison.** Giving one framework dataset-specific instructions while leaving others on default settings is not a fair evaluation. It measures "how well can you tune prompts for a specific test" rather than "how well does the framework perform."

2. **Overfitting to the benchmark.** Tailoring the retain mission to capture negative statements and contradictions — patterns specifically scored by BEAM — is functionally equivalent to overfitting to a test set. The resulting scores do not reflect real-world generalization.

3. **Misleading to users.** Anyone relying on these benchmarks to make an adoption decision is being presented with an apples-to-oranges comparison.

### Questions

- Is there a justification for why Hindsight receives dataset-specific prompt tuning while other frameworks do not?
- Would you be willing to publish results where all frameworks either (a) use their default configs, or (b) are each given equivalent dataset-specific tuning?
- Are there other datasets in the benchmark suite that have similar framework-specific optimizations?

I'd appreciate transparency on this. Benchmarks are only useful when the methodology is sound.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Integrity Concern: Dataset-Specific Prompt Tuning #5

Evidence

Why This Is Problematic

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark Integrity Concern: Dataset-Specific Prompt Tuning #5

Description

Evidence

Why This Is Problematic

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions