324 Scientific Law Discovery Tasks β’ 12 Physics Domains β’ Interactive Model Systems
β¨Moving beyond memorization toward true scientific discovery in complex, interactive environmentsβ¨
NewtonBench is the first benchmark designed to rigorously evaluate LLMs' ability to discover scientific laws through interactive experimentation rather than static function fitting. Our benchmark resolves the fundamental trilemma between scientific relevance, scalability, and memorization resistance through metaphysical shiftsβsystematic alterations of canonical physical laws.
- 324 tasks across 12 physics domains (Gravitation, Coulomb's Law, Fourier's Law, etc.)
- Interactive model systems requiring active experimentation and hypothesis testing
- Two difficulty dimensions: law complexity (easy/medium/hard) Γ system complexity (vanilla/simple/complex)
- Code-assisted evaluation to isolate reasoning from computational constraints
- Memorization-resistant through metaphysical shifts of canonical laws
- Frontier models (GPT-5, Gemini-2.5-pro) show clear but fragile discovery capabilities
- Performance degrades precipitously with increasing system complexity and noise
- Paradoxical tool effect: Code assistance helps weaker models but hinders stronger ones
- Extreme noise sensitivity: Even 0.0001 noise level causes 13-15% accuracy drop
NewtonBench reveals that while LLMs are beginning to develop scientific reasoning skills, robust, generalizable discovery in complex environments remains the core challenge for automated science.
- 09 Oct, 2025: The paper is released on arXiv!
- π₯ News
- π Get Started
- ποΈ Project Structure
- π¬ Key Components
- π§ͺ Running Full Experiments
- π Analyzing Results
- π Citation
git clone https://github.com/HKUST-KnowComp/NewtonBench.git
cd NewtonBench
conda create --name newtonbench python=3.10.18
conda activate newtonbench
pip install -r requirements.txt
- In the root of the project, make a copy of the
.env.example
file and rename it.env
. - Specify the following:
OPENAI_API_KEY
: Your OpenAI API key for using OpenAI modelsOPENROUTER_API_KEY
: Your OpenRouter API key for using models provided in OpenRouter
You are now ready to run a quick test to ensure everything is set up correctly.
python quick_start.py
The quick_start.py
script will run two simple experiments using the gpt41mini
model under "vanilla agent" and "code-assisted agent" modes for "Gravitation" domain, equation difficulty as "easy" and model system as "vanilla equation"
NewtonBench/
βββ .env # environment variables (API keys)
βββ configs/ # Configuration files
β βββ models.txt # List of LLM models to evaluate
β
βββ modules/ # Physics domain modules (12 domains)
β βββ common/ # Shared utilities and base classes
β β βββ evaluation.py # Evaluation metrics and logic
β β βββ physics_base.py # Base physics system definitions
β β βββ prompts_base.py # Base prompt templates
β β βββ types.py # Common type definitions
β β
β βββ m0_gravity/ # Newtonβs Law of Universal Gravitation
β βββ m1_coulomb_force/ # Coulombβs Law
β βββ m2_magnetic_force/ # Ampereβs Force Law
β βββ m3_fourier_law/ # Fourierβs Law
β βββ m4_snell_law/ # Snellβs Law
β βββ m5_radioactive_decay/ # Law of Radioactive Decay
β βββ m6_underdamped_harmonic/ # Law of Damped Harmonic Motion
β βββ m7_malus_law/ # Malusβs Law
β βββ m8_sound_speed/ # Law of Sound Speed in Ideal Gas
β βββ m9_hooke_law/ # Hookeβs Law
β βββ m10_be_distribution/ # Bose-Einstein Distribution
β βββ m11_heat_transfer/ # Law of Heat Transfer
β β
β βββ Each module contains:
β βββ core.py # Core experiment runner
β βββ laws.py # Law definitions and variations
β βββ physics.py # Physics simulation logic
β βββ prompts.py # Domain-specific prompts
β βββ m*_types.py # Domain-specific types
β
βββ utils/ # Utility modules
β βββ call_llm_api.py # LLM API interface
β βββ vanilla_agent.py # Vanilla agent (no code execution)
β βββ code_assisted_agent.py # Code-assisted agent
β βββ code_executor.py # Code execution environment
β βββ code_executor_base.py # Base code executor interface
β βββ noise.py # Noise generation utilities
β
βββ evaluation_results/ # Experimental results organized by:
β βββ {model_name}/ # - Model name
β βββ {module}/ # - Physics module
β βββ {agent_type}/ # - Agent type (vanilla/code-assisted)
β βββ {difficulty}/ # - Difficulty level
β βββ {version}/ # - Version
β
βββ result_analysis/ # Scripts for analyzing results
β βββ summarize_results.py # Main script to summarize results
β βββ results_by_trial.csv # Intermediate CSV with raw trial data
β βββ aggregated_trial_summary.csv # Final aggregated summary
β
βββ quick_start.py # Quick start demo script
βββ run_master.py # Main experiment runner
βββ run_experiments.py # Batch experiment executor
βββ run_all_evaluations.py # Comprehensive evaluation script
βββ requirements.txt # Python dependencies
βββ README.md
- Physics Modules: Each of the 12 physics domains is implemented as a separate module with its own physics simulation, law definitions, and prompts.
- Agent Types: Two agent modes are supported:
- Vanilla Agent: LLM reasoning only, no code execution
- Code-Assisted Agent: LLM with Python code execution capabilities
- Difficulty Levels: Tasks vary across two dimensions:
- Difficulty of the target law: easy/medium/hard
- Complexity of the model systems: vanilla equation/simple system/complex system
To replicate more comprehensive evaluations as described in the paper, the run_master.py
script allows you to run the full benchmark across all physics modules and a variety of LLM models.
You can specify a list of LLM models to test by editing the configs/models.txt
file. The default file includes all 11 LLMs evaluated in our paper
Example configs/models.txt
:
# List of models to be evaluated
gpt41
o4mini
gpt5
Remark: The model names in the models.txt
file must match exactly with those specified in utils/call_llm_api.py
.
Once you have configured the models.txt
file, you can run the benchmark with the following command. The --parallel
argument specifies how many experiments to run in parallel.
python run_master.py --parallel 5
If you want to run the benchmark for a single model, you can use the --model_name
command-line argument.
python run_master.py --model_name gpt41mini --parallel 5
The --parallel
argument controls the number of concurrent processes. A higher number will run more experiments and open more terminals at the same time, which can be faster but will also consume more system resources.
# Run 8 experiments in parallel
python run_master.py --parallel 8
After running experiments, you can use the result_analysis/summarize_results.py
script to process and aggregate the results into a summary CSV file.
The script performs two main functions in a single run:
- Consolidation: It finds all individual trial
.json
files in theevaluation_results
directory and compiles them into a single raw data file:result_analysis/results_by_trial.csv
. - Aggregation: It then processes
results_by_trial.csv
, performs statistical analysis (including outlier detection) and generates a final summary csv file namedaggregated_trial_summary.csv
.
To generate the summary for all models listed in configs/models.txt
, run:
python result_analysis/summarize_results.py
You can also generate the summary for a single model by specifying its name. For example:
python result_analysis/summarize_results.py --model_name gpt41mini
If you use NewtonBench in your research, please cite our paper:
@misc{zheng2025newtonbenchbenchmarkinggeneralizablescientific,
title={NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents},
author={Tianshi Zheng and Kelvin Kiu-Wai Tam and Newt Hue-Nam K. Nguyen and Baixuan Xu and Zhaowei Wang and Jiayang Cheng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Tianqing Fang and Yangqiu Song and Ginny Y. Wong and Simon See},
year={2025},
eprint={2510.07172},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.07172},
}
Tianshi Zheng ([email protected])
Kelvin Kiu-Wai Tam ([email protected])
Newt Hue-Nam K. Nguyen ([email protected])