Skip to content

Conversation

@ruisizhang123
Copy link
Contributor

@ruisizhang123 ruisizhang123 commented Nov 20, 2025

As discussed, we will add an additional CI to guard compiler passes' composability.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2025
@ruisizhang123 ruisizhang123 marked this pull request as draft November 20, 2025 23:24
@ruisizhang123 ruisizhang123 marked this pull request as ready for review November 25, 2025 20:49
@ruisizhang123 ruisizhang123 changed the title [WIP][SimpleFSDP] add CI to guard compiler optimization passes [SimpleFSDP] add CI to guard compiler optimization passes Nov 25, 2025
python -m torchtitan.experiments.simple_fsdp.tests.frontend_integration_tests artifacts-to-be-uploaded --ngpu 8
# Run backend pass integration tests of SimpleFSDP
python -m torchtitan.experiments.simple_fsdp.tests.compiler_pass_integration_tests artifacts-to-be-uploaded --ngpu 8 --comm_mode local_tensor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried FakeBackend mode, but the memory overhead is significantly higher @fegin 🤔 (~33Gib in Local_tensor -> ~90GiB in FakeBackend). My suspect is FakeBackend initialize the whole model on one rank?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is the reverse, for FakeBackend mode, it should be lower. On the other hand, local_tensor will allocate all tensors on the same process. So it should be higher. cc., @dzmitry-huba

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, not sure which part is wrong. here is an easy repro. The memory I reported in prev message is on compiler CI test.

  1. NGPU=4 COMM_MODE="fake_backend" CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
[titan] 2025-11-25 18:27:58,478 - root - INFO - step:  1  loss: 12.2713  grad_norm:  0.0000  memory: 47.60GiB(50.11%)  tps: 1,112  tflops: 64.41  mfu: 6.51%
  1. NGPU=4 COMM_MODE="local_tensor" CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

Since there is no actual training, the peak memory is ~31GiB from nvidia-smi.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, I see. This is because we current skip the real training for LocalTensor because LocalTensor doesn't support FSDP2. But it should work with SimpleFSDP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, it is also skipped in simplefsdp. Not sure why, but fake_backend gives huge memory overhead in simplefsdp's compiler pass CI (~90GiB).

I think I can either verify with RealBackend (more memory overhead than LocalTensor, but smaller than fake backend); or using a LocalTensor mode (less memory overhead, but doesn't execute actual training).

Curious: which part LocalTensor mode is executing? If it executes compilation but skips the actual training, it looks sufficient for our case, since we are just verifying compiler passes' integration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if splitting into two tests will incur overhead.
@wwwjn does this incur any overhead to CI?


for idx, override_arg in enumerate(test_flavor.override_args):
cmd = f"CONFIG_FILE={full_path} NGPU={test_flavor.ngpu} LOG_RANK={all_ranks} ./run_train.sh"
if comm_mode == "default":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comm_mode probably should be put in OverrideDefinitions?

Copy link
Contributor Author

@ruisizhang123 ruisizhang123 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, fake_tensor & local_tensor mode uses python -m instead of torchrun. We need to update how cmd launches ./run_train.sh here. We can use add a COMM_MODE in front of ./run_train.sh tho.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still use fake_tensor with torchrun. The reason I don't use torchrun is that for dry run case, you don't actually need torchrun. Direct invoking the module saves us more time.

Copy link
Contributor Author

@ruisizhang123 ruisizhang123 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I guess you mean fake_backend? Sure, we can use fake_backend mode with torchrun. However, in this case, all ranks would issue a process and run things in parallel. What is the difference between using fake_backend mode and a real backend mode then?

In CI testing, I though we use a fake_backend because we wanted to use dry run to verify things with fewer GPU?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm just saying that if you need to use torchrun to verify (e.g., MPMD). But for SimpleFSDP, I don't think that is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants