Per-epoch slowdown for frequent checkpointing

# Description
There seems to be an issue where frequent checkpointing impacts the training time negatively. Slowing training by >2x every 100 epochs or so.
- [x] This **does not** happen on matrix (cuda) or at least not noticeably. This has been observed for wheels for rocm6 and rocm7 as well as multiple torch versions (2.8 and 2.10)
- [x] The issue is the exact same on NFS and Lustre, so speed of the filesystem is irrelevant.

# Examples of the issue (`checkpoint_interval: 1`)
The per-epoch time here **does not** include the checkpointing (synchronous).
```
# Expected behavior (CUDA)
Epoch 6 completed in 3.9217326641082764 seconds. Total train time so far: 171.78343605995178
...
Epoch 99 completed in 3.8156378269195557 seconds. Total train time so far: 3106.8321845531464

# Issue (ROCm)
Epoch 2 completed in 5.56635046005249 seconds. Total train time so far: 164.07908129692078
...
Epoch 46 completed in 26.325266361236572 seconds. Total train time so far: 3019.2210574150085
```

# Examples of the issue (`checkpoint_interval: 10`)
Not nearly as bad, but still significant.
```
Epoch 2 completed in 6.140593528747559 seconds. Total train time so far: 129.89002299308777
...
Epoch 70 completed in 10.53577995300293 seconds. Total train time so far: 943.0159168243408
```


# Solution
disable checkpointing (by setting `checkpoint_interval > epochs`)

If checkpointing is necessary, this may be mitagated with a very high `checkpoint_interval`. suggest `checkpoint_interval > 100`
- [ ] check impact of `checkpoint_interval: 100`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-epoch slowdown for frequent checkpointing #23

Description

Examples of the issue (`checkpoint_interval: 1`)

Examples of the issue (`checkpoint_interval: 10`)

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Per-epoch slowdown for frequent checkpointing #23

Description

Description

Examples of the issue (checkpoint_interval: 1)

Examples of the issue (checkpoint_interval: 10)

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Examples of the issue (`checkpoint_interval: 1`)

Examples of the issue (`checkpoint_interval: 10`)