-
Notifications
You must be signed in to change notification settings - Fork 4
Per-epoch slowdown for frequent checkpointing #23
Copy link
Copy link
Open
Description
Description
There seems to be an issue where frequent checkpointing impacts the training time negatively. Slowing training by >2x every 100 epochs or so.
- This does not happen on matrix (cuda) or at least not noticeably. This has been observed for wheels for rocm6 and rocm7 as well as multiple torch versions (2.8 and 2.10)
- The issue is the exact same on NFS and Lustre, so speed of the filesystem is irrelevant.
Examples of the issue (checkpoint_interval: 1)
The per-epoch time here does not include the checkpointing (synchronous).
# Expected behavior (CUDA)
Epoch 6 completed in 3.9217326641082764 seconds. Total train time so far: 171.78343605995178
...
Epoch 99 completed in 3.8156378269195557 seconds. Total train time so far: 3106.8321845531464
# Issue (ROCm)
Epoch 2 completed in 5.56635046005249 seconds. Total train time so far: 164.07908129692078
...
Epoch 46 completed in 26.325266361236572 seconds. Total train time so far: 3019.2210574150085
Examples of the issue (checkpoint_interval: 10)
Not nearly as bad, but still significant.
Epoch 2 completed in 6.140593528747559 seconds. Total train time so far: 129.89002299308777
...
Epoch 70 completed in 10.53577995300293 seconds. Total train time so far: 943.0159168243408
Solution
disable checkpointing (by setting checkpoint_interval > epochs)
If checkpointing is necessary, this may be mitagated with a very high checkpoint_interval. suggest checkpoint_interval > 100
- check impact of
checkpoint_interval: 100
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels