Gavel is a scheduler that takes into account the performance heterogeneity of underlying accelerators (GPUs, TPUs, etc.) when training DNN jobs. Gavel makes existing policies heterogeneity-aware and incorporates them as optimization problems.
- Introduction
- Background
- DNN Training
- Performance Optimizations
- System Overview
- Heterogeneity-Aware Policies
- Round-based Scheduling Mechanism
- Throughput Estimator
- Limitations and Non-Goals
- Scheduling Policies
- Max-Min Fairness as an Optimization Problem
- Other Policies as Optimization Problems
- Hierarchical Scheduling Policies
- Properties of Gavel's Policies
- Scheduling Mechanism
- Implementation
- Evaluation
- Experiment Setup
- End-to-End Results on Physical Cluster
- End-to-End Results in Simulation
- Scalability of Heterogeneity-Aware Policies
- Efficacy of Scheduling Mechanism
- Impact of Throughput Estimation
- Related Work and Discussion
- Conclusion
For a cluster scheudler, choosing the optimal accelerator types is difficult for three reasons:
- Performance heterogeneity: Common deep learning training workloads show heterogeneous performance on different hardware accelerators due to architectural differences. Not being aware of this result in suboptimal allocations.
- Generality across policies: Recent scheudlers like Allox and Gandiva optimize for a single scheduling objective, while in reality, clusters may take differnet (and sophisticated) scheduling policies.
- Colocation and optimization optimizations: The performance benefits of these optimizations should be considered explicitly while optimizing for global scheduling objectives, since these optimizations are more effective when deployed in a heterogeneity-aware way.
Gavel expresses scheduling policies as optimization problems and produces an allocation matrix that indicates the fraction of time each job should spend on the different accelerators between allocation recomputations (new jobs arriving, old jobs finishing, periodic recomputations, etc).
Gavel converted the following policies into optimization problems and included them in the code base:
- LAS/Max-min fairness: Least Attained Service policy, used by Tiresias
- LAS w/ weights
- Minimize makespan
- Finish time fairness: Used by Themis
- FIFO
- Shortest job first (SJF)
- Minimize cost (in public cloud instances)
- Minimize cost w/ SLOs
- Hierarchical (multi-level policy: FIFO, fairness, etc.)
With the optimal allocation computed, Gavel tries to dispatch jobs while matching the optimal allocation as close as possible by using two techniques:
- Round-based scheduling: Gavel allows users to set the round length (optimal length that allows for the effective approximation is 6 minutes). This mechanism ensures that jobs receive time on accelerator types according to the optimal allocation.
- Per-job priority score: In each round, the scheduler runs jobs in decreasing priority order. The priority for each job is computed as the target allocation divided by the number of rounds received.
The paper also covers extensive evaluations on:
- How well the heterogeneity-aware policies improve objective metrics, both in simulation and physical experiments
- How do Gavel's policies scale
- Whether Gavel can accurately estimate the throughputs of co-located jobs when using space sharing