Skip to content

Batch acceleration

David Anderson edited this page Nov 13, 2025 · 10 revisions

The batch completion problem

A typical app might have jobs that take average ~1 hour CPU time to complete. But the turnaround time on a particular host H may be higher because:

  • H has a large work buffer, and other jobs must complete before this one starts
  • H computes only sporadically
  • H is slower than average

So turnaround time on H could be several days. So the 'max delay' setting for the app may need to be, say, 1 week.

If there's a large batch - say 1000 jobs - some of them will get sent to hosts that never complete them. After a week these jobs time out and we resend them to other hosts. But some of these hosts may never complete them, or complete them with large turnaround time.

As a result, the 'makespan' of the batch - the time from submission to 100% completion - may be several weeks.

We'd like to reduce batch makespan using scheduling techniques; we call this 'batch acceleration'.

Proposed design

Our goal is to reduce makespan with minimal complexity. We're not concerned with performance; current projects have a few thousand hosts, not millions.

The basic idea: mark certain hosts as 'low turnaround time' (LTT). Mark the last 10% or so of jobs in each batch as 'high priority'. Use LTT hosts to run high-priority jobs.

This involves several components:

  • Scheduler: enforce the above scheduling rule.

  • batch_stats.php (new): scan batches, compute median TT; For each host, compute average of 'normalized TT' (TT/median). A host is LTT if this is < 1. For each app, make a list of hosts that have returned jobs. If at least 100, and 25% are LTT, app is accelerable.

  • batch_accel.php (new): periodically scan in-progress batches, identifying those that need acceleration. Mark jobs as high-priority, and possibly create new instances.

Data

host.error_rate: 1 if average of TT/median < 1, else 0

app.n_size_classes: nonzero means accelerable

batch.expire_time: median TT of success jobs

batch_stats.php

Runs every hour or so.

For each batch with at least 50% success jobs, compute median TT of success jobs.

For each job in such a batch, its 'TT ratio' is TT/median if success, ~10 if not.

Set host.error_rate to the average of TT ratios of its jobs (over all batches). A host is low-turnaround if this is < 1.

Note: the distribution of TT is generally a big clump below some level, with a bunch of outliers. We want to exclude the outliers.

For each app A, let N = # of hosts that completed a job for this app, M = # of these hosts marked as low_turnaround. Set A.accelerable if N > 100 and M > N*.25

Scheduler

Job selection:

if a job is high priority and app is accelerable
    if host is LTT
        boost job score
    else
        don't sent

batch_accel.php

Runs every hour or so.

For each in-progress batch B that's at least 90% complete:
    if app is accelerable
        compute average TT of completed jobs
        For each uncompleted job
            mark WU as high priority
            mark its unsent results as high priority
            if no unsent results,
                and in-prog results are older than average TT,
                and #results < max_total_results
                increment wu.target_nresults, trigger transition
    else
        set all incomplete WUs and unsent results to zero priority

Feeder

Ideally, we want both high and low prio jobs in shmem, so we'll have jobs for both LTT and non-LTT hosts.

I added a feeder option --batch_accel that adds a random factor to get a mix of high and low prio jobs.

Notes/issues

In deciding whether a host is LTT, the above doesn't distinguish apps, app versions, or BUDA variants. It won't work as intended in situations where e.g. a host succeeds with CPU versions but fails with GPU versions.

Clone this wiki locally