Skip to content

HKUST-KnowComp/NewtonBench

Repository files navigation

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

GitHub stars arXiv Python Visitors

πŸ”­ Can LLMs Rediscover Newton's Laws?

324 Scientific Law Discovery Tasks β€’ 12 Physics Domains β€’ Interactive Model Systems

✨Moving beyond memorization toward true scientific discovery in complex, interactive environments✨


πŸš€ TL;DR

NewtonBench is the first benchmark designed to rigorously evaluate LLMs' ability to discover scientific laws through interactive experimentation rather than static function fitting. Our benchmark resolves the fundamental trilemma between scientific relevance, scalability, and memorization resistance through metaphysical shiftsβ€”systematic alterations of canonical physical laws.

🎯 Key Features

  • 324 tasks across 12 physics domains (Gravitation, Coulomb's Law, Fourier's Law, etc.)
  • Interactive model systems requiring active experimentation and hypothesis testing
  • Two difficulty dimensions: law complexity (easy/medium/hard) Γ— system complexity (vanilla/simple/complex)
  • Code-assisted evaluation to isolate reasoning from computational constraints
  • Memorization-resistant through metaphysical shifts of canonical laws

πŸ”¬ What We Discovered

  • Frontier models (GPT-5, Gemini-2.5-pro) show clear but fragile discovery capabilities
  • Performance degrades precipitously with increasing system complexity and noise
  • Paradoxical tool effect: Code assistance helps weaker models but hinders stronger ones
  • Extreme noise sensitivity: Even 0.0001 noise level causes 13-15% accuracy drop

πŸ† Why It Matters

NewtonBench reveals that while LLMs are beginning to develop scientific reasoning skills, robust, generalizable discovery in complex environments remains the core challenge for automated science.


Framework
Quick Overview of NewtonBench.

πŸ”₯ News

  • 09 Oct, 2025: The paper is released on arXiv!

πŸ“‹ Table of Contents

πŸš€ Get Started

1. Clone the Repository

git clone https://github.com/HKUST-KnowComp/NewtonBench.git
cd NewtonBench

2. Create and Activate a Conda Environment

conda create --name newtonbench python=3.10.18
conda activate newtonbench

3. Install Dependencies

pip install -r requirements.txt

4. Set Up API Keys

  1. In the root of the project, make a copy of the .env.example file and rename it .env.
  2. Specify the following:
    • OPENAI_API_KEY: Your OpenAI API key for using OpenAI models
    • OPENROUTER_API_KEY: Your OpenRouter API key for using models provided in OpenRouter

5. Run the Quick Start

You are now ready to run a quick test to ensure everything is set up correctly.

python quick_start.py

The quick_start.py script will run two simple experiments using the gpt41mini model under "vanilla agent" and "code-assisted agent" modes for "Gravitation" domain, equation difficulty as "easy" and model system as "vanilla equation"

πŸ—οΈ Project Structure

NewtonBench/
β”œβ”€β”€ .env                          # environment variables (API keys)
β”œβ”€β”€ configs/                      # Configuration files
β”‚   └── models.txt                # List of LLM models to evaluate
β”‚
β”œβ”€β”€ modules/                      # Physics domain modules (12 domains)
β”‚   β”œβ”€β”€ common/                   # Shared utilities and base classes
β”‚   β”‚   β”œβ”€β”€ evaluation.py         # Evaluation metrics and logic
β”‚   β”‚   β”œβ”€β”€ physics_base.py       # Base physics system definitions
β”‚   β”‚   β”œβ”€β”€ prompts_base.py       # Base prompt templates
β”‚   β”‚   └── types.py              # Common type definitions
β”‚   β”‚
β”‚   β”œβ”€β”€ m0_gravity/               # Newton’s Law of Universal Gravitation
β”‚   β”œβ”€β”€ m1_coulomb_force/         # Coulomb’s Law
β”‚   β”œβ”€β”€ m2_magnetic_force/        # Ampere’s Force Law
β”‚   β”œβ”€β”€ m3_fourier_law/           # Fourier’s Law
β”‚   β”œβ”€β”€ m4_snell_law/             # Snell’s Law
β”‚   β”œβ”€β”€ m5_radioactive_decay/     # Law of Radioactive Decay
β”‚   β”œβ”€β”€ m6_underdamped_harmonic/  # Law of Damped Harmonic Motion
β”‚   β”œβ”€β”€ m7_malus_law/             # Malus’s Law
β”‚   β”œβ”€β”€ m8_sound_speed/           # Law of Sound Speed in Ideal Gas
β”‚   β”œβ”€β”€ m9_hooke_law/             # Hooke’s Law
β”‚   β”œβ”€β”€ m10_be_distribution/      # Bose-Einstein Distribution
β”‚   └── m11_heat_transfer/        # Law of Heat Transfer
β”‚   β”‚
β”‚   └── Each module contains:
β”‚       β”œβ”€β”€ core.py               # Core experiment runner
β”‚       β”œβ”€β”€ laws.py               # Law definitions and variations
β”‚       β”œβ”€β”€ physics.py            # Physics simulation logic
β”‚       β”œβ”€β”€ prompts.py            # Domain-specific prompts
β”‚       └── m*_types.py           # Domain-specific types
β”‚
β”œβ”€β”€ utils/                        # Utility modules
β”‚   β”œβ”€β”€ call_llm_api.py        # LLM API interface
β”‚   β”œβ”€β”€ vanilla_agent.py          # Vanilla agent (no code execution)
β”‚   β”œβ”€β”€ code_assisted_agent.py    # Code-assisted agent
β”‚   β”œβ”€β”€ code_executor.py          # Code execution environment
β”‚   β”œβ”€β”€ code_executor_base.py     # Base code executor interface
β”‚   └── noise.py                  # Noise generation utilities
β”‚
β”œβ”€β”€ evaluation_results/           # Experimental results organized by:
β”‚   └── {model_name}/             # - Model name
β”‚       └── {module}/             # - Physics module
β”‚           └── {agent_type}/     # - Agent type (vanilla/code-assisted)
β”‚               └── {difficulty}/ # - Difficulty level
β”‚                   └── {version}/  # - Version
β”‚
β”œβ”€β”€ result_analysis/              # Scripts for analyzing results
β”‚   β”œβ”€β”€ summarize_results.py      # Main script to summarize results
β”‚   β”œβ”€β”€ results_by_trial.csv      # Intermediate CSV with raw trial data
β”‚   └── aggregated_trial_summary.csv    # Final aggregated summary
β”‚
β”œβ”€β”€ quick_start.py                # Quick start demo script
β”œβ”€β”€ run_master.py                 # Main experiment runner
β”œβ”€β”€ run_experiments.py            # Batch experiment executor
β”œβ”€β”€ run_all_evaluations.py        # Comprehensive evaluation script
β”œβ”€β”€ requirements.txt              # Python dependencies
└── README.md                   

πŸ”¬ Key Components

  • Physics Modules: Each of the 12 physics domains is implemented as a separate module with its own physics simulation, law definitions, and prompts.
  • Agent Types: Two agent modes are supported:
    • Vanilla Agent: LLM reasoning only, no code execution
    • Code-Assisted Agent: LLM with Python code execution capabilities
  • Difficulty Levels: Tasks vary across two dimensions:
    • Difficulty of the target law: easy/medium/hard
    • Complexity of the model systems: vanilla equation/simple system/complex system

πŸ§ͺ Running Full Experiments

To replicate more comprehensive evaluations as described in the paper, the run_master.py script allows you to run the full benchmark across all physics modules and a variety of LLM models.

Method 1: Using models.txt

You can specify a list of LLM models to test by editing the configs/models.txt file. The default file includes all 11 LLMs evaluated in our paper

Example configs/models.txt:

# List of models to be evaluated
gpt41
o4mini
gpt5

Remark: The model names in the models.txt file must match exactly with those specified in utils/call_llm_api.py.

Once you have configured the models.txt file, you can run the benchmark with the following command. The --parallel argument specifies how many experiments to run in parallel.

python run_master.py --parallel 5

Method 2: Specifying a Single Model

If you want to run the benchmark for a single model, you can use the --model_name command-line argument.

python run_master.py --model_name gpt41mini --parallel 5

Controlling Parallelism

The --parallel argument controls the number of concurrent processes. A higher number will run more experiments and open more terminals at the same time, which can be faster but will also consume more system resources.

# Run 8 experiments in parallel
python run_master.py --parallel 8

πŸ“ˆ Analyzing Results

After running experiments, you can use the result_analysis/summarize_results.py script to process and aggregate the results into a summary CSV file.

The script performs two main functions in a single run:

  1. Consolidation: It finds all individual trial .json files in the evaluation_results directory and compiles them into a single raw data file: result_analysis/results_by_trial.csv.
  2. Aggregation: It then processes results_by_trial.csv, performs statistical analysis (including outlier detection) and generates a final summary csv file named aggregated_trial_summary.csv.

To generate the summary for all models listed in configs/models.txt, run:

python result_analysis/summarize_results.py

You can also generate the summary for a single model by specifying its name. For example:

python result_analysis/summarize_results.py --model_name gpt41mini

🌟 Citation

If you use NewtonBench in your research, please cite our paper:

@misc{zheng2025newtonbenchbenchmarkinggeneralizablescientific,
      title={NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents}, 
      author={Tianshi Zheng and Kelvin Kiu-Wai Tam and Newt Hue-Nam K. Nguyen and Baixuan Xu and Zhaowei Wang and Jiayang Cheng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Tianqing Fang and Yangqiu Song and Ginny Y. Wong and Simon See},
      year={2025},
      eprint={2510.07172},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.07172}, 
}

Contacts

Tianshi Zheng ([email protected])

Kelvin Kiu-Wai Tam ([email protected])

Newt Hue-Nam K. Nguyen ([email protected])

About

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages