A Post-Human Level (PHL) benchmark for testing superintelligent AI systems through dice prediction.
As AI systems increasingly match or exceed human performance on traditional benchmarks, we need new ways to measure capabilities beyond human limits. Just as humans can predict vehicle trajectories—a task impossible for simpler animals—advanced AI systems should demonstrate superior prediction of complex physical systems like dice rolls, even when humans cannot. This creates an opportunity to measure intelligence at levels far above human capability, rather than using human-level performance as a ceiling.
DiceBench is a novel benchmark designed to evaluate AI systems beyond human-level performance. It consists of a private evaluation set of 100 videos and a public dataset of 10 videos, each showing a die being rolled and cutting approximately one second before it comes to rest.
The public dataset contains 10 videos available in the /public/dataset
directory:
FEA.webm, FEB.webm, FEC.webm - Expected outcome: 5
FYA.webm, FYB.webm, FYC.webm - Expected outcome: 4
SA.webm, SB.webm - Expected outcome: 6
TA.webm, TB.webm - Expected outcome: 3
Each video is recorded using a Galaxy S24 camera and shows a die being rolled across different surface types, cutting before the final outcome is visible.
The evaluation process is implemented in the /evaluation
directory:
openai_evaluator.py
- Main evaluation script for running GPT-4o on video inputsvideo_mapping.py
- Maps video filenames to their expected outcomes
To run the evaluation:
- Install dependencies:
pip install openai python-dotenv opencv-python
- Set up your OpenAI API key:
export OPENAI_API_KEY='your-key-here'
- Run the evaluator:
python evaluation/openai_evaluator.py
System | Accuracy |
---|---|
Random Baseline | 16.67% |
Human Performance | 27% |
GPT-4o | 33% |
The private evaluation set is kept secure to maintain benchmark integrity. Researchers interested in evaluating their models can contact us at [email protected] for access.
@misc{dicebench2024,
title = {DiceBench: A Post-Human Level Benchmark},
author = {Lindahl, Rasmus},
year = {2024},
publisher = {becose},
url = {https://dicebench.vercel.app},
note = {AI consultancy specializing in advanced machine learning solutions}
}
This project is licensed under the MIT License - see the LICENSE file for details.