Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/integration_test_8gpu_features.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,5 +76,11 @@ jobs:
export TEST_WITH_ROCM=$([[ "${{ matrix.gpu-arch-type }}" == "rocm" ]] && echo 1 || echo 0)
python -m tests.integration_tests.run_tests --test_suite features $RUNNER_TEMP/artifacts-to-be-uploaded --ngpu 8

# Verify the accuracy.
export baseline_options="--parallelism.data_parallel_replicate_degree=1"
export test_options="--parallelism.data_parallel_replicate_degree=4"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comparing these two? FSDP 8 vs. HSDP 4, 2

Also we are not using the assert_equal flag, so what happens if the losses are different? Are you analyzing them with some other scripts?

I thought the idea was we should compare main and PR submit, but for that it may be easier to assert with a fixed loss number.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should add --assert_equal—I'll update it.

I originally planned to compare the current PR with the main branch. The problem is that CI mounts the folder as read-only, so I couldn't check out the "main" version. I took a step back and decided to verify the minimum guarantees: at least some parallelism variations should have the same losses. Since this check is quick, I think it doesn't hurt to add it (we can also verify TP). This way, we ensure the minimum guarantees and also validate the script.

For comparing with the "main" branch, one way to achieve this is to pre-record the results and add an option to the tool to read them. Both baseline and test runs must match the pre-recorded losses when --assert_equal is set. The downside is that this approach is less stable than simply checking out during CI, since the pre-recorded method can't tolerate machine or library changes.

I prefer to do this in another PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the pre-recorded method can't tolerate machine or library changes.

Good point, but here are two cases if numerics change?

  1. not "error", just differences between machines / libraries
  2. library introduces a bug which causes both main and branch on all parallelisms to be the same and simultaneously incorrect.

I guess we could say the latter should be captured by upstream library?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can do this. Will do this in another PR.

python3 scripts/loss_compare.py . . --baseline-options="${baseline_options}" --test-options="${test_options}" --steps=10 --job-dump-folder="${RUNNER_TEMP}/artifacts-to-be-uploaded/accuracy_comparison_outputs"

# Cleanup the checkpoints so that we don't waste network bandwidth and time.
rm -rf $RUNNER_TEMP/artifacts-to-be-uploaded/*/checkpoint
rm -rf artifacts-to-be-uploaded/*/checkpoint
Loading
Loading