[detailed][pipeline] Explore new pipeline that overlaps optimizer with emb_lookup #2916

TroyGarden · 2025-04-24T22:57:01Z

Summary:

context

this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer.
Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint.

WARNING: This pipeline DOES NOT work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

benchmark readings

runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist

TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False

traces show that:
(1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
(2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
(3) the optimizer seems NOT overlap with emb_lookup kernel when embedding_lookup_after_opt = False

(4) the optimizer still does NOT overlap with emb_lookup kernel, but it fills in the gap between the KJTTensorAwaitable.wait() and the embedding lookup kernel when embedding_lookup_after_opt = True

(5) if use a separate stream for embedding lookup, so that the following start_sparse_data_dist can start immediately. however this causes extra memory consumption.

(6) if re-use the data_dist stream for embedding lookup, the following up start_sparse_data_dist will wait for embedding lookup to complete, the measured memory footprint is smaller

NOTE: Based on (5) and (6) we set use_emb_lookup_stream = False is the default behavior

conclusions

Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime.
In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline.
It would be the user's choice for which one to use.

reference

https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486

Differential Revision: D64479105

facebook-github-bot · 2025-04-24T22:57:11Z