Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

deepak-gowda-narayana
Copy link
Contributor

@deepak-gowda-narayana deepak-gowda-narayana commented Mar 4, 2025

What does this PR do?

SD3 pipeline was running softmax kernels with inconsistent tensor shapes leading to low utilization. Observed from the perf lib logs as shown below

The Geometry of the input and the kernel geometry are inconsistent in softmax_hf8 on account of which the kernel utilization is less than 50 percent and equals to 25.000000 percent

Optimization :

This PR addresses the issue of low utilization by padding prompt embeddings to a shape compatible with softmax_hf8 kernels, leading to better utilization and faster execution of diffusion attention process.

  • Padding prompt embeddings gave a performance increase of 21%

The below table shows the performance change from default execution which uses the regular prompt embedding shape of 333 vs the padded version using 384.

Batch size inf steps Padded Embedding size No of Images Batches Gaudi throughput (samples/sec) Time for 1 Image Generation (sec)
1 40 384 - padded 5 5 0.067 14.92537313
1 40 333 - default 5 5 0.055 18.18181818

Observation from Profile Trace

The performance gain is achieved because this tensor reshaping reduces the number of softmax kernels used to just one-third of the total compared to the default execution
The information of softmax kernels is summarized below for both default and optimized case

Type of execution Op Name Num Nodes Avg single engine active (us)
Default softmax_fwd_hf8 14440 145074.07
Default softmax_fwd_hf8 iter 2 14440 145071.953
optimized softmax_fwd_hf8 8664 145795.544
optimized softmax_fwd_bf16 32 1308.537

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@deepak-gowda-narayana
Copy link
Contributor Author

output from make style

ruff check . setup.py --fix
Found 1 error (1 fixed, 0 remaining).
ruff format . setup.py
1 file reformatted, 430 files left unchanged

@deepak-gowda-narayana
Copy link
Contributor Author

@dsocek Please review PR

Copy link
Contributor

@dsocek dsocek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants