Fsdp2 support for activation checkpointing #359

le1nux · 2025-04-18T17:02:56Z

What does this PR do?

This PR adds activation checkpointing (AC) support for FSDP2.
There are now three AC variants:

Full AC (same as before, where entire complete modules get ACed, leading to the largest memory footprint reduction)
Selective Layer AC (only very nth layer or module is ACed)
Selective OP Ac (only certain OPs, typically low memory but compute intense, are checkpointed)

Additionally

Minor restructurings of the code
new results subscriber variant EvaluationResultToDiscSubscriber (will be used in benchmark tooling)
new class RandomDatasetBatchGenerator (will be used in profiler)
changes in get_compiled_model: We check now that a module that is to be compiled has exactly one parent module that references it and throw an exception otherwise. Before we were only replacing the compiled model for one of the parents and silently skipped the other parents.
changes in experiment_id generation: Previously get_experiment_id_from_config(...) contained local experiment id generation and syncing. I refactored it such that we can now also sync arbitrary strings.
Originally this PR also had profling and benchmarking tooling. Since this is not production ready yet, I moved it to https://github.com/Modalities/modalities/tree/legacy_profiling_env

Checklist before submitting final PR

[] My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…w for testing with a distributed environment

…tion_checkpointing' into fsdp2_activation_checkpointing

rrutmann · 2025-07-17T12:39:25Z

src/modalities/util.py

+    if config_file_path is None:
+        experiment_id = f"{date_of_run}"
+    else:
+        hash = hashlib.sha256(str(config_file_path).encode()).hexdigest()[:hash_length]


Why do we hash the path of the config instead of the content of the file? The latter would yield the same experiment ID for identical configs

That's a valid point. We can address this later on. Before we also did not hash the file itself.
#388

src/modalities/util.py

rrutmann · 2025-07-17T13:03:31Z

src/modalities/utils/profilers/batch_generator.py

+        raise NotImplementedError
+
+
+class RandomDatasetBatchGenerator(DatasetBatchGeneratorIF):


In which case do we need this? What is the advantage over using our non-random test data?

I also wonder why this module was added. The class does not seem to be used anywhere.

src/modalities/training/activation_checkpointing/activation_checkpointing.py

tests/training/test_activation_checkpointing.py

rrutmann · 2025-07-18T09:53:40Z

src/modalities/training/activation_checkpointing/activation_checkpointing.py

+
+        def _selective_checkpointing_context_fn():
+            meta = defaultdict(int)
+            save_ops_set = {ActivationCheckpointing.SAVE_DICT[key] for key in save_ops_keys}


This throws an error for operations that are not listed in ActivationCheckpointing.SAVE_DICT. Why do we restrict to the ops in ActivationCheckpointing.SAVE_DICT?

in the config we can only define strings and we need to map the string to the respective function. In theory, we could do something with eval() but I also found that rather ugly and error prone. I would suggest, we run the benchmarking w.r.t. SAC and check if we need to make it more generic?

rrutmann · 2025-07-18T10:05:28Z

src/modalities/training/activation_checkpointing/activation_checkpointing.py

+                mm_count_key = f"{mode}_mm_count"
+                if func == torch.ops.aten.mm.default:
+                    meta[mm_count_key] += 1
+                # Saves output of all compute ops in save_ops_set, except every second mm


Why only every second?

I followed the setup in torchtitan: https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/infra/parallelize.py#L301

From my understanding it's for balancing compute vs memory savings. If we wanted to make this completely configurable, we would have to store the checkpointing frequency of every OP.

flxst

Great work! I left a few comments.

Generally, it is a bit hard to understand the various changes that are seemingly unrelated to AC:

new results subscriber variant EvaluationResultToDiscSubscriber
new class RandomDatasetBatchGenerator
changes in get_compiled_model
changes in experiment_id generation

I think it would be good to at least list them explicitly in the PR description (instead of "Minor restructurings of the code") and indicate their purpose.

src/modalities/trainer.py

src/modalities/config/config.py

flxst · 2025-07-21T08:54:21Z

src/modalities/models/model_factory.py

+            | ActivationCheckpointedModelConfig.SelectiveOpACParams
+        ),
+    ) -> nn.Module:
+        """FSDP2 variant for applying activation checkpointing to the given model (in-place operation).


Suggested change

"""FSDP2 variant for applying activation checkpointing to the given model (in-place operation).

"""General variant for applying activation checkpointing to the given model (in-place operation).

.. since it can be used in the absence of FSDP2, directly on nn.Module classes (as the name of the method also indicates).

I would say since we are always either using FSDP1 or FSDP2 and don't support training a model without either of those parallelizations, we should be a bit more restrictive here. Otherwise, the user might get a wrong idea even though in theory it could be possible but in practice it's not.

Ok, I get your point. I wonder if we should put "FSDP2" in the function name (like we do with FSDP1) to emphasize this, despite the model being a nn.Module. This might be slightly less confusing.

src/modalities/config/config.py

src/modalities/training/activation_checkpointing/activation_checkpointing.py

flxst · 2025-07-21T09:56:56Z

tests/training/test_activation_checkpointing.py

+        (22310, 2, "config_activation_checkpointing_fsdp1_legacy.yaml"),
+    ],
+)
+def test_full_activation_checkpointing_FSDP1_legacy(world_size: int, rdvz_port: int, relative_config_path: str):


Good question. I think the point is that modalities with FSDP1 is stable and has successfully been used for model training in practice. FSDP2, in contrast, requires some additional work (like this PR, or #374). Once the work is done and modalities with FSDP2 has proven to be stable and reliable in practice, we will probably drop support for FSDP1 after a certain grace period.

tests/training/test_activation_checkpointing.py

flxst · 2025-07-21T10:15:26Z

src/modalities/utils/profilers/batch_generator.py

+        raise NotImplementedError
+
+
+class RandomDatasetBatchGenerator(DatasetBatchGeneratorIF):


I also wonder why this module was added. The class does not seem to be used anywhere.

src/modalities/logging_broker/subscriber_impl/results_subscriber.py

flxst

The first three tests in tests/training/test_activation_checkpointing.py require 2 GPUs, but they are not skipped if only a single GPU is available.

This makes the tests with github actions fail:
https://github.com/Modalities/modalities/actions/runs/16443411219/job/46469270473

Locally, the tests also fail with CUDA_VISIBLE_DEVICES=0, while they run through with CUDA_VISIBLE_DEVICES=0,1.

flxst

LGTM

rrutmann

LGTM

le1nux added 30 commits April 18, 2025 19:01

refactor: towards fsdp2 support for activation checkpointing

725a218

feat: added test configs for AC

8362e0d

feat: added selective activation checkpointing configs

a546189

feat: added selective activation checkpointing tests

61070d1

feat: added selective activation checkpointing strategies

9c7573b

feat: wired up selective activation checkpointing

034662a

refactor: experiment id can now be passed to Main constructor to allo…

a9fe173

…w for testing with a distributed environment

chore: added doc strings to the selective AC entry point

f977525

refactor: renamed typing and logging module to fix name shadowing issues

ba3c9e2

refactor: reneamed typing module to typing_utils

391d4eb

feat: added SAC benchmarking script

c6b6e47

chore: added activation checkpointing benchmark config

742ab46

chore: removed legacy code

641fad1

refactor: extracted save_list to parent in AC

f940b0e

refactor: moved Main from __main__ to main

c346f0f

feat: added batch generator util

90c4bbe

refactor: split experiment_id syncing into multiple utity functions

c87688a

feat: implemented grid search setup for profiling

ebaf13e

refactor: added OOM error handling in CudaEnv

bcc0e7b

feat: added torchrun script for distributed profiling

899401a

feat: added profiling README

87296d1

feat: added profiler implementation

d6dddc0

feat: drafted profile logs analyzer

011e41b

chore: minor renamings

e6da67f

refactor: making sure that each compil

2d4a9ad

feat: added torchrun launcher

d990420

refactor: wrapped up the profiler_starter

f4a5b48

feat: added activation checkpoint profiling example

b2aa30b

feat: setup forward pass profiling

94025af

feat: ops in selective op activation checkpointing are now configurable

3ea7cf5

le1nux added 10 commits June 5, 2025 19:53

feat: added peak memory tracking

ec56a05

feat: added results subscriber that writes JSONL to disc

cbab52c

chore: Merge remote-tracking branch 'refs/remotes/origin/fsdp2_activa…

c1561e2

…tion_checkpointing' into fsdp2_activation_checkpointing

chore: replaced timer with perf_counter

edbda8c

chore: resolved merge conflict

dd9a453

refactor: removed profiling and bechmarking code

53dc418

refactor: removed profiler entry point

aa57a0e

chore: fixed failing unit tests

048498c

chore: removed legacy code

ce8e753

chore: Merge branch 'main' into fsdp2_activation_checkpointing

1e9a64e

le1nux requested review from flxst and rrutmann July 17, 2025 11:15

le1nux marked this pull request as ready for review July 17, 2025 11:20

flxst mentioned this pull request Jul 18, 2025

Tensor parallelism #374

Merged

6 tasks

rrutmann requested changes Jul 18, 2025

View reviewed changes

rrutmann reviewed Jul 18, 2025

View reviewed changes

flxst requested changes Jul 21, 2025

View reviewed changes

chore: addressed requested changes to AC PR

14bf44e

le1nux requested review from rrutmann and flxst July 22, 2025 10:28

chore: added docstrings

ebd1018

le1nux mentioned this pull request Jul 22, 2025

Hashing of config file content instead of config file path as part of the experiment_id #388

Open

chore: added better naming for AC FSDP2 entrypoint

0840180

flxst requested changes Jul 22, 2025

View reviewed changes

chore: added skip if not enough GPUs for activation checkpointing tests

64cbc00

flxst self-requested a review July 22, 2025 12:52

flxst approved these changes Jul 22, 2025

View reviewed changes

rrutmann approved these changes Jul 22, 2025

View reviewed changes

le1nux merged commit f6f663b into main Jul 22, 2025
7 checks passed

le1nux deleted the fsdp2_activation_checkpointing branch July 22, 2025 14:19

		raise NotImplementedError


		class RandomDatasetBatchGenerator(DatasetBatchGeneratorIF):

	"""FSDP2 variant for applying activation checkpointing to the given model (in-place operation).
	"""General variant for applying activation checkpointing to the given model (in-place operation).

Fsdp2 support for activation checkpointing #359

Fsdp2 support for activation checkpointing #359

Uh oh!

Conversation

le1nux commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist before submitting final PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

le1nux Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

flxst left a comment

Choose a reason for hiding this comment

Uh oh!

rrutmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

le1nux commented Apr 18, 2025 •

edited

Loading

le1nux Jul 22, 2025 •

edited

Loading