OpenSeeSimE is a multimodal benchmark for evaluating the capabilities of vision-language models (VLMs) in understanding and reasoning about engineering simulation outputs. This benchmark addresses the growing need for evaluating AI models on domain-specific technical tasks beyond general-purpose visual understanding.
The OpenSeeSimE benchmark comprises multimodal data from 12 distinct simulation examples (6 structural, 6 fluid dynamics) generated using standard engineering simulation software (Ansys). Each simulation is represented in three modalities:
- Images: Static visualizations of simulation results (.png format)
- Videos: Animated visualizations of simulation behavior (.gif for structural, .mp4 for fluid)
- Text: Detailed simulation reports and parameters (.txt format)
- File_1: Bicycle crank under load
- File_2: Pretension bolt assembly
- File_3: Thermal stress expansion case
- File_4: Point mass system (drone)
- File_5: Valve housing under pressure
- File_6: Brittle material under bending and torsion
- File_7: Airfoil
- File_8: Falling-sphere viscometer
- File_9: Natural convection model
- File_10: High-speed rail aerodynamics
- File_11: Heat exchange pipe flow
- File_12: Pressure-driven cavitating flow
The benchmark defines a binary question-answering task using 40 expert-crafted true/false questions (20 for structural simulations, 20 for fluid dynamics) that test understanding of engineering principles and the ability to interpret simulation outputs.
Example questions include:
- "Are there any stress concentrations exceeding material limits?"
- "Is mass conservation satisfied?"
- "Is the flow regime laminar?"
These questions are designed to test fundamental engineering concepts while remaining geometry-agnostic to evaluate model generalization capabilities.
The benchmark is provided as a JSON file (benchmark/qa-benchmark-with-qids.json
) with the following structure:
{
"structural": {
"description": "Questions related to structural simulation analysis",
"questions": [
{
"qid": "a7e94b2c6f1d83052c9e",
"question": "Are there any stress concentrations exceeding material limits?"
},
// Additional questions...
],
"files": {
"File_1": {
"answers": {
"a7e94b2c6f1d83052c9e": false,
// Additional answers...
}
},
// Additional files...
}
},
"fluid": {
// Similar structure for fluid dynamics questions and answers
}
}
import json
def load_benchmark(benchmark_path="benchmark/qa-benchmark-with-qids.json"):
"""Load the SeeSim benchmark questions and answers."""
with open(benchmark_path, 'r') as f:
benchmark_data = json.load(f)
return benchmark_data
# Example usage
benchmark = load_benchmark()
# Access questions for a specific domain
structural_questions = benchmark["structural"]["questions"]
fluid_questions = benchmark["fluid"]["questions"]
# Access answers for a specific file
file1_answers = benchmark["structural"]["files"]["File_1"]["answers"]
# Example of loading an image file
def load_image(domain, file_id):
"""Load an image file for a specific domain and file ID."""
image_path = f"data/{domain}/images/{file_id}.png"
# Load using your preferred image library
return image_path
# Example of loading a video file
def load_video(domain, file_id):
"""Load a video file for a specific domain and file ID."""
ext = "gif" if domain == "structural" else "mp4"
video_path = f"data/{domain}/videos/{file_id}.{ext}"
# Load using your preferred video library
return video_path
# Example of loading a text file
def load_text(domain, file_id):
"""Load a text file for a specific domain and file ID."""
text_path = f"data/{domain}/text/{file_id}.txt"
with open(text_path, 'r') as f:
text_content = f.read()
return text_content
For more detailed evaluation guidelines, see benchmark/evaluation_guidelines.md
.
SeeSim-Benchmark/
├── README.md
├── LICENSE
├── data/
│ ├── structural/
│ │ ├── images/ # File_1.png, File_2.png, etc.
│ │ ├── videos/ # File_1.gif, File_2.gif, etc.
│ │ └── text/ # File_1.txt, File_2.txt, etc.
│ └── fluid/
│ ├── images/ # File_7.png, File_8.png, etc.
│ ├── videos/ # File_7.mp4, File_8.mp4, etc.
│ └── text/ # File_7.txt, File_8.txt, etc.
├── benchmark/
│ ├── qa-benchmark-with-qids.json
│ ├── evaluation_guidelines.md
│ ├── benchmark_loader.py
│ └── metadata.json
└── paper/
├── simulation_vs_hallucination.pdf
└── figures/
We recommend using the following metrics for evaluation:
- Overall Accuracy: Percentage of correctly answered questions across all files
- Domain-Specific Accuracy: Separate accuracies for structural and fluid dynamics questions
- Logical Consistency: Rate of contradictions between paired opposing questions
- Validation Accuracy: Performance on questions testing fundamental physical laws
Our paper establishes baseline performance metrics for several state-of-the-art vision-language models across three modalities:
Model | Text | Video | Image |
---|---|---|---|
LLaVA | 66.3% | 53.8% | - |
Phi-3 | 62.5% | - | - |
Qwen | 54.6% | 52.9% | 52.9% |
Llama | 52.1% | - | 54.6% |
BLIP | - | 52.9% | 52.9% |
Kosmos | - | 52.9% | 52.9% |
Key findings:
- Text modality consistently outperforms visual modalities
- Models perform better on structural analysis than fluid dynamics
- Performance is still far below requirements for reliable engineering applications
- Models show significant inconsistencies when answering related questions
This benchmark is released for research purposes only. The ground truth answers are included to enable reproducibility and transparent evaluation. Researchers are encouraged to use these answers only for evaluation and not for fine-tuning models specifically to this benchmark.
If you use this benchmark in your research, please cite our paper:
Upcoming Citation
This repository is released under the MIT License. The benchmark dataset is released for research purposes.
We thank Ansys for providing the simulation examples used in this benchmark.
For questions or issues, please open an issue in this repository.