DiceBench 🎲

A Post-Human Level (PHL) benchmark for testing superintelligent AI systems through dice prediction.

Motivation

As AI systems increasingly match or exceed human performance on traditional benchmarks, we need new ways to measure capabilities beyond human limits. Just as humans can predict vehicle trajectories—a task impossible for simpler animals—advanced AI systems should demonstrate superior prediction of complex physical systems like dice rolls, even when humans cannot. This creates an opportunity to measure intelligence at levels far above human capability, rather than using human-level performance as a ceiling.

Overview

DiceBench is a novel benchmark designed to evaluate AI systems beyond human-level performance. It consists of a private evaluation set of 100 videos and a public dataset of 10 videos, each showing a die being rolled and cutting approximately one second before it comes to rest.

Public Dataset

The public dataset contains 10 videos available in the /public/dataset directory:

FEA.webm, FEB.webm, FEC.webm  - Expected outcome: 5
FYA.webm, FYB.webm, FYC.webm  - Expected outcome: 4
SA.webm, SB.webm              - Expected outcome: 6
TA.webm, TB.webm              - Expected outcome: 3

Each video is recorded using a Galaxy S24 camera and shows a die being rolled across different surface types, cutting before the final outcome is visible.

Evaluation

The evaluation process is implemented in the /evaluation directory:

openai_evaluator.py - Main evaluation script for running GPT-4o on video inputs
video_mapping.py - Maps video filenames to their expected outcomes

To run the evaluation:

Install dependencies:

pip install openai python-dotenv opencv-python

Set up your OpenAI API key:

export OPENAI_API_KEY='your-key-here'

Run the evaluator:

python evaluation/openai_evaluator.py

Current Results

System	Accuracy
Random Baseline	16.67%
Human Performance	27%
GPT-4o	33%

Access

The private evaluation set is kept secure to maintain benchmark integrity. Researchers interested in evaluating their models can contact us at [email protected] for access.

Citation

@misc{dicebench2024,
  title = {DiceBench: A Post-Human Level Benchmark},
  author = {Lindahl, Rasmus},
  year = {2024},
  publisher = {becose},
  url = {https://dicebench.vercel.app},
  note = {AI consultancy specializing in advanced machine learning solutions}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
evaluation		evaluation
public		public
src		src
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
README.md		README.md
components.json		components.json
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DiceBench 🎲

Motivation

Overview

Public Dataset

Evaluation

Current Results

Access

Citation

License

About

Uh oh!

Releases

Packages

Languages

mrconter1/dice-bench

Folders and files

Latest commit

History

Repository files navigation

DiceBench 🎲

Motivation

Overview

Public Dataset

Evaluation

Current Results

Access

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages