Skip to content

Per-epoch slowdown for frequent checkpointing #23

@michaelmckinsey1

Description

@michaelmckinsey1

Description

There seems to be an issue where frequent checkpointing impacts the training time negatively. Slowing training by >2x every 100 epochs or so.

  • This does not happen on matrix (cuda) or at least not noticeably. This has been observed for wheels for rocm6 and rocm7 as well as multiple torch versions (2.8 and 2.10)
  • The issue is the exact same on NFS and Lustre, so speed of the filesystem is irrelevant.

Examples of the issue (checkpoint_interval: 1)

The per-epoch time here does not include the checkpointing (synchronous).

# Expected behavior (CUDA)
Epoch 6 completed in 3.9217326641082764 seconds. Total train time so far: 171.78343605995178
...
Epoch 99 completed in 3.8156378269195557 seconds. Total train time so far: 3106.8321845531464

# Issue (ROCm)
Epoch 2 completed in 5.56635046005249 seconds. Total train time so far: 164.07908129692078
...
Epoch 46 completed in 26.325266361236572 seconds. Total train time so far: 3019.2210574150085

Examples of the issue (checkpoint_interval: 10)

Not nearly as bad, but still significant.

Epoch 2 completed in 6.140593528747559 seconds. Total train time so far: 129.89002299308777
...
Epoch 70 completed in 10.53577995300293 seconds. Total train time so far: 943.0159168243408

Solution

disable checkpointing (by setting checkpoint_interval > epochs)

If checkpointing is necessary, this may be mitagated with a very high checkpoint_interval. suggest checkpoint_interval > 100

  • check impact of checkpoint_interval: 100

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions