Skip to content

Benchmark Integrity Concern: Dataset-Specific Prompt Tuning #5

@MT2-0901

Description

@MT2-0901

I've been reviewing the benchmark evaluation code and noticed that Hindsight appears to use dataset-specific prompt optimization that other compared frameworks do not receive.

This raises a serious concern about the fairness and validity of the reported benchmark results.

Evidence

In hindsight.py, there is a dataset-specific retain_mission prompt engineered specifically for the BEAM dataset:

# hindsight.py - BEAM dataset-specific instructions
_BEAM_RETAIN_MISSION = (
    "Extract ALL factual claims the user makes about themselves... "
    "including NEGATIVE statements (e.g. 'I have never done X')..."
    "Also preserve contradictions..."
)

This prompt is carefully crafted to capture the exact patterns that BEAM evaluates — negative self-statements, contradictions, and exhaustive factual claims.

Meanwhile, the competing frameworks (Mem0, Cognee, Mastra, etc.) are evaluated using their generic, out-of-the-box configurations with no equivalent dataset-aware tuning.

Why This Is Problematic

  1. Unequal comparison. Giving one framework dataset-specific instructions while leaving others on default settings is not a fair evaluation. It measures "how well can you tune prompts for a specific test" rather than "how well does the framework perform."

  2. Overfitting to the benchmark. Tailoring the retain mission to capture negative statements and contradictions — patterns specifically scored by BEAM — is functionally equivalent to overfitting to a test set. The resulting scores do not reflect real-world generalization.

  3. Misleading to users. Anyone relying on these benchmarks to make an adoption decision is being presented with an apples-to-oranges comparison.

Questions

  • Is there a justification for why Hindsight receives dataset-specific prompt tuning while other frameworks do not?
  • Would you be willing to publish results where all frameworks either (a) use their default configs, or (b) are each given equivalent dataset-specific tuning?
  • Are there other datasets in the benchmark suite that have similar framework-specific optimizations?

I'd appreciate transparency on this. Benchmarks are only useful when the methodology is sound.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions