A GPipe implementation in PyTorch
-
Updated
Jul 25, 2024 - Python
A GPipe implementation in PyTorch
An I/O benchmark for deep Learning applications
Very-Low Overhead Checkpointing System
Extending DOLFINx with checkpointing functionality
Keras wrapper that autosaves what ModelCheckpoint cannot.
This FLINK project will consume streams from an azure event-hub and produce to a different event-hub ,and the config files for deploying the same in kubernetes
Code and tutorial on integrating wandb sweeps with Slurm pre-emption
A lightweight checkpointing program written in C.
DMTCP scripts to get Python scripts working with SLURM.
A shared library to help test your code with failure-injection
This is a standalone flink producer using for testing the flink-consume-produce-ek repo contents
A python package for performing memory intensive computations in parallel using chunks and checkpointing.
Robust distributed checkpointing and job management system for multi-GPU SLURM workloads
Automatic checkpointing and job resubmission system for robust LLM training on Slurm-based HPC clusters. Collaboration with @vulus98
A digital album face recognition manager, that isolates images of a specified person from a digital album.
Koo and Toueg’s checkpointing and recovery protocol
Add a description, image, and links to the checkpointing topic page so that developers can more easily learn about it.
To associate your repository with the checkpointing topic, visit your repo's landing page and select "manage topics."