Skip to content

Explore new pipeline that overlaps optimizer with emb_lookup #2916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

TroyGarden
Copy link
Contributor

Summary:

context

  • this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer.
  • Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
  • we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
  • there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint.

WARNING: This pipeline DOES NOT work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

benchmark readings

  • runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist
TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False
  • traces show that:
    (1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
    (2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
    (3) the optimizer seems NOT overlap with emb_lookup kernel when embedding_lookup_after_opt = False
    {F1977309185}
    (4) the optimizer still does NOT overlap with emb_lookup kernel, but it fills in the gap between the KJTTensorAwaitable.wait() and the embedding lookup kernel when embedding_lookup_after_opt = True
    {F1977309202}
    (5) if use a separate stream for embedding lookup, so that the following start_sparse_data_dist can start immediately. however this causes extra memory consumption.
    {F1977366363}
    (6) if re-use the data_dist stream for embedding lookup, the following up start_sparse_data_dist will wait for embedding lookup to complete, the measured memory footprint is smaller
    {F1977366349}
    NOTE: Based on (5) and (6) we set use_emb_lookup_stream = False is the default behavior

conclusions

  • Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime.
  • In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
  • The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline.
  • It would be the user's choice for which one to use.

reference

Differential Revision: D64479105

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64479105

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Apr 29, 2025
…#2916)

Summary:

# context
* this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer. 
* Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
* we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
* there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint. 

WARNING: This pipeline **DOES NOT** work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

# benchmark readings
* runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist
```
TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False
```
* traces show that:
(1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
(2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
(3) the optimizer seems **NOT** overlap with emb_lookup kernel when `embedding_lookup_after_opt = False`
{F1977309185}
(4) the optimizer still does **NOT** overlap with emb_lookup kernel, but it fills in the gap between the `KJTTensorAwaitable.wait()` and the embedding lookup kernel when `embedding_lookup_after_opt = True`
{F1977309202}
(5) if use a separate stream for embedding lookup, so that the following `start_sparse_data_dist` can start immediately. however this causes extra memory consumption.
{F1977366363}  
(6) if re-use the data_dist stream for embedding lookup, the following up `start_sparse_data_dist` will wait for embedding lookup to complete, the measured memory footprint is smaller
 {F1977366349} 
NOTE: Based on (5) and (6) we set `use_emb_lookup_stream = False` is the default behavior

# conclusions
* Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime. 
* In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
* The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline. 
* It would be the user's choice for which one to use.

# reference
* https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486

Differential Revision: D64479105
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64479105

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request May 1, 2025
…#2916)

Summary:

# context
* this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer. 
* Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
* we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
* there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint. 

WARNING: This pipeline **DOES NOT** work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

# benchmark readings
* runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist
```
TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False
```
* traces show that:
(1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
(2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
(3) the optimizer seems **NOT** overlap with emb_lookup kernel when `embedding_lookup_after_opt = False`
{F1977309185}
(4) the optimizer still does **NOT** overlap with emb_lookup kernel, but it fills in the gap between the `KJTTensorAwaitable.wait()` and the embedding lookup kernel when `embedding_lookup_after_opt = True`
{F1977309202}
(5) if use a separate stream for embedding lookup, so that the following `start_sparse_data_dist` can start immediately. however this causes extra memory consumption.
{F1977366363}  
(6) if re-use the data_dist stream for embedding lookup, the following up `start_sparse_data_dist` will wait for embedding lookup to complete, the measured memory footprint is smaller
 {F1977366349} 
NOTE: Based on (5) and (6) we set `use_emb_lookup_stream = False` is the default behavior

# conclusions
* Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime. 
* In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
* The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline. 
* It would be the user's choice for which one to use.

# reference
* https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486

Differential Revision: D64479105
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64479105

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request May 1, 2025
…#2916)

Summary:
Pull Request resolved: pytorch#2916

# context
* this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer.
* Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
* we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
* there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint.

WARNING: This pipeline **DOES NOT** work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

# benchmark readings
* runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist
```
TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False
```
* traces show that:
(1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
(2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
(3) the optimizer seems **NOT** overlap with emb_lookup kernel when `embedding_lookup_after_opt = False`
{F1977309185}
(4) the optimizer still does **NOT** overlap with emb_lookup kernel, but it fills in the gap between the `KJTTensorAwaitable.wait()` and the embedding lookup kernel when `embedding_lookup_after_opt = True`
{F1977309202}
(5) if use a separate stream for embedding lookup, so that the following `start_sparse_data_dist` can start immediately. however this causes extra memory consumption.
{F1977366363}
(6) if re-use the data_dist stream for embedding lookup, the following up `start_sparse_data_dist` will wait for embedding lookup to complete, the measured memory footprint is smaller
 {F1977366349}
NOTE: Based on (5) and (6) we set `use_emb_lookup_stream = False` is the default behavior

# conclusions
* Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime.
* In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
* The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline.
* It would be the user's choice for which one to use.

# reference
* https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486

Differential Revision: D64479105
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request May 1, 2025
…#2916)

Summary:

# context
* this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer. 
* Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
* we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
* there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint. 

WARNING: This pipeline **DOES NOT** work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

# benchmark readings
* runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist
```
TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False
```
* traces show that:
(1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
(2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
(3) the optimizer seems **NOT** overlap with emb_lookup kernel when `embedding_lookup_after_opt = False`
{F1977309185}
(4) the optimizer still does **NOT** overlap with emb_lookup kernel, but it fills in the gap between the `KJTTensorAwaitable.wait()` and the embedding lookup kernel when `embedding_lookup_after_opt = True`
{F1977309202}
(5) if use a separate stream for embedding lookup, so that the following `start_sparse_data_dist` can start immediately. however this causes extra memory consumption.
{F1977366363}  
(6) if re-use the data_dist stream for embedding lookup, the following up `start_sparse_data_dist` will wait for embedding lookup to complete, the measured memory footprint is smaller
 {F1977366349} 
NOTE: Based on (5) and (6) we set `use_emb_lookup_stream = False` is the default behavior

# conclusions
* Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime. 
* In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
* The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline. 
* It would be the user's choice for which one to use.

# reference
* https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486

Reviewed By: dstaay-fb

Differential Revision: D64479105
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64479105

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request May 1, 2025
…#2916)

Summary:

# context
* this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer. 
* Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
* we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
* there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint. 

WARNING: This pipeline **DOES NOT** work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

# benchmark readings
* runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist
```
TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False
```
* traces show that:
(1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
(2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
(3) the optimizer seems **NOT** overlap with emb_lookup kernel when `embedding_lookup_after_opt = False`
{F1977309185}
(4) the optimizer still does **NOT** overlap with emb_lookup kernel, but it fills in the gap between the `KJTTensorAwaitable.wait()` and the embedding lookup kernel when `embedding_lookup_after_opt = True`
{F1977309202}
(5) if use a separate stream for embedding lookup, so that the following `start_sparse_data_dist` can start immediately. however this causes extra memory consumption.
{F1977366363}  
(6) if re-use the data_dist stream for embedding lookup, the following up `start_sparse_data_dist` will wait for embedding lookup to complete, the measured memory footprint is smaller
 {F1977366349} 
NOTE: Based on (5) and (6) we set `use_emb_lookup_stream = False` is the default behavior

# conclusions
* Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime. 
* In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
* The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline. 
* It would be the user's choice for which one to use.

# reference
* https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486

Reviewed By: dstaay-fb

Differential Revision: D64479105
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64479105

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request May 1, 2025
…#2916)

Summary:

# context
* this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer. 
* Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
* we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
* there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint. 

WARNING: This pipeline **DOES NOT** work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

# benchmark readings
* runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist
```
TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False
```
* traces show that:
(1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
(2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
(3) the optimizer seems **NOT** overlap with emb_lookup kernel when `embedding_lookup_after_opt = False`
{F1977309185}
(4) the optimizer still does **NOT** overlap with emb_lookup kernel, but it fills in the gap between the `KJTTensorAwaitable.wait()` and the embedding lookup kernel when `embedding_lookup_after_opt = True`
{F1977309202}
(5) if use a separate stream for embedding lookup, so that the following `start_sparse_data_dist` can start immediately. however this causes extra memory consumption.
{F1977366363}  
(6) if re-use the data_dist stream for embedding lookup, the following up `start_sparse_data_dist` will wait for embedding lookup to complete, the measured memory footprint is smaller
 {F1977366349} 
NOTE: Based on (5) and (6) we set `use_emb_lookup_stream = False` is the default behavior

# conclusions
* Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime. 
* In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
* The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline. 
* It would be the user's choice for which one to use.

# reference
* https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486

Reviewed By: dstaay-fb

Differential Revision: D64479105
…#2916)

Summary:
Pull Request resolved: pytorch#2916

# context
* this workstream started from an training QPS optimization initiated from the PG side (see doc in the reference section), observing the embedding lookup can overlap with the optimizer.
* Embedding table weights are updated in the fused backward (fused-TBE), so the embedding lookup can start immediately after backward is completed without dependency on the optiimzer.
* we use a separate stream to run embedding lookup so that it can overlap with the previous optimizer (changed, see below)
* there is also an option of using data_dist stream for this embedding lookup, the output_dist won't be block but the start_sparse_data_dist would, which results a smaller mem footprint.

WARNING: This pipeline **DOES NOT** work for EBC/EC with feature processors because the embedding lookup is started immediately after TBE backward (where the embedding tables' weights have been updated)

# benchmark readings
* runtime: SemiSync < FusedSparseDist (lookup after opt) < FusedSparseDist (lookup before opt) < SparseDist
```
TrainPipelineSemiSync         | Runtime (P90): 5447.42 ms | Peak Memory alloc (P90): 61.63 GB | Peak Memory reserved (P90): 64.31 GB
TrainPipelineFusedSparseDist  | Runtime (P90): 5605.63 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.61 GB
TrainPipelineFusedSparseDist* | Runtime (P90): 5661.92 ms | Peak Memory alloc (P90): 53.23 GB | Peak Memory reserved (P90): 68.67 GB
TrainPipelineSparseDist       | Runtime (P90): 6034.46 ms | Peak Memory alloc (P90): 51.80 GB | Peak Memory reserved (P90): 62.25 GB
* embedding_lookup_after_opt = False
```
* traces show that:
(1) the emb_lookup is right behind the TBE-bwd (on the same cuda stream)
(2) the output_dist is invoked right after each emb_lookup (there are two, one for unweighted ebc, one for weighted)
(3) the optimizer seems **NOT** overlap with emb_lookup kernel when `embedding_lookup_after_opt = False`
{F1977309185}
(4) the optimizer still does **NOT** overlap with emb_lookup kernel, but it fills in the gap between the `KJTTensorAwaitable.wait()` and the embedding lookup kernel when `embedding_lookup_after_opt = True`
{F1977309202}
(5) if use a separate stream for embedding lookup, so that the following `start_sparse_data_dist` can start immediately. however this causes extra memory consumption.
{F1977366363}
(6) if re-use the data_dist stream for embedding lookup, the following up `start_sparse_data_dist` will wait for embedding lookup to complete, the measured memory footprint is smaller
 {F1977366349}
NOTE: Based on (5) and (6) we set `use_emb_lookup_stream = False` is the default behavior

# conclusions
* Based on a simple model (SparseNN), both "Fused Sparse Dist" pipeline and the "Semi Sync" pipeline are faster than the current default (commonly used) "Sparse Dist" pipeline, respectively -7% (fused sparse dist) and -10% (semi sync) in runtime.
* In a more realistic scenario, the optimizer step has a longer runtime footprint, which can amplify this optimization.
* The "Semi Sync" pipeline has a larger QPS win but it produces slightly different numerical training results, while the "Fused Sparse Dist" pipeline with a slight few QPS win should be numerically the same as the default pipeline.
* It would be the user's choice for which one to use.

# reference
* https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486

Reviewed By: dstaay-fb

Differential Revision: D64479105
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64479105

@TroyGarden TroyGarden deleted the export-D64479105 branch June 19, 2025 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants