Performance benchmarks for coupled AI and simulation workflows on HPC systems.
SimAI-Bench is a flexible and extensible interface for emulating, benchmarking, and prototyping AI-coupled HPC workflows. With its easy-to-use Python packages and intuitive API, SimAI-Bench allows users to compose faithful mini-apps of their full end-to-end workflows with diverse simulation and AI components and multiple in situ or in transit data transfer patterns.
SimAI-Bench is composed of the following modules:
- Kernel: The
Kernel
module, adapted from the workflow-mini-API, provides primitives for key compute, IO, communication, and copy operations. It leverages CuPy and dpnp for compute and copy operations on NVIDIA, AMD and Intel GPUs, mpi4py for communication across ranks, and HDF5 for I/O operations to the disk. - torch.nn: SimAI-Bench leverages the
torch.nn
module in PyTorch to execute ML kernels or build ML models of various architecture. - ServerManager and DataStore: The
ServerManager
andDataStore
classes provide a unified interface for data staging and streaming through various approaches and software libraries, also referred to as data transport backends. SimAI-Bench currently supports data staging through Redis, DragonHPC, the parallel file system, and high-performance node-local flash storage or RAM memory. TheServerManager
is used to initialize and manage the resources of a particular backend, whereas theDataStore
provides unified client API which enable the data to be transferred between any workflow component. - Simulation: The
Simulation
class is the main component that emulates traditional HPC simulation components within AI-coupled workflows. It uses the underlyingKernels
to emulate the computational and communication aspects, and theDataStore
module to interact with the data staging backends. A simulation is configured using a Python dictionary or a JSON file that specifies all the necessary parameters. - AI: The
AI
class is used to model the machine learning portions of the workflow by leveraging thetorch.nn
andtorch.distributed
modules in PyTorch. Similarly to theSimulation
class, it can use theDataStore
API to stage and stream data to other workflow components. - Workflow: At the highest level, the
Workflow
class provides the functionality to compose a workflow from variousSimulation
andAI
components and deploy it on an HPC system with the desired data transport strategy.
To install SimAI-Bench on Aurora with all of its required dependencies, clone and install the project locally with pip
after creating a Python virtual environment
module load frameworks
python -m venv _simai --system-site-packages
. _simai/bin/activate
git clone https://github.com/argonne-lcf/SimAI-Bench.git
cd SimAI-Bench
pip install .
To configure the DragonHPC's high-speed transport agent (HSTA) to use Libfabric on Aurora, execute
dragon-config -a "ofi-runtime-lib=/opt/cray/libfabric/1.22.0/lib64"
This example demonstrates how to use the SimAI-Bench modules to create a simple workflow with two simulations components, sharing data through the file system and with a task dependency.
from SimAIBench import Workflow
from SimAIBench import Simulation
from SimAIBench import ServerManager
# Initialize the server for data staging and the workflow instance
server = ServerManager("server", config=server_config) # the default staging backend is the file system
server.start_server()
info=server.get_server_info()
w = Workflow(sys_info=sys_config)
# Add the first simulation component to the workflow
@w.component(name="sim",
type="remote", # set remote execution, meaning launching through mpirun/mpiexec, needed for multi-node or MPI components
args={"info":info})
def run_sim(info=None):
sim = Simulation(name="sim",
server_info=info)
sim.add_kernel("MatMulSimple2D")
sim.run()
sim.stage_write("key","value") # write data to the staging backend
# Add the second simulation component to the workflow
@w.component(name="sim2",
type="local", # set local execution, meaning launching through Python mp.Process on the local node
args={"info":info},
dependencies=["sim"]) # set a dependency on the first simulation component
def run_sim2(info=None):
sim = Simulation(name="sim2",
server_info=info)
sim.add_kernel("MatMulGeneral")
value = sim.stage_read("key") # read data from the staging backend
sim.run()
# Execute the workflow
w.launch()
# Stop the server
server.stop_server()
- Fork it (https://github.com/argonne-lcf/SimAI-Bench/fork)
- Cline it (
git clone https://github.com/username/SimAI-Bench.git
) - Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push -u origin feature/fooBar
) - Create a new Pull Request
Harikrishna Tummalapalli, Argonne National Lab, [email protected]
Riccardo Balin, Argonne National Lab, [email protected]
Christine Simpson, Argonne National Lab, [email protected]