Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816

deepak-gowda-narayana · 2025-03-04T22:46:22Z

What does this PR do?

SD3 pipeline was running softmax kernels with inconsistent tensor shapes leading to low utilization. Observed from the perf lib logs as shown below

The Geometry of the input and the kernel geometry are inconsistent in softmax_hf8 on account of which the kernel utilization is less than 50 percent and equals to 25.000000 percent

Optimization :

This PR addresses the issue of low utilization by padding prompt embeddings to a shape compatible with softmax_hf8 kernels, leading to better utilization and faster execution of diffusion attention process.

Padding prompt embeddings gave a performance increase of 21%

The below table shows the performance change from default execution which uses the regular prompt embedding shape of 333 vs the padded version using 384.

Batch size	inf steps	Padded Embedding size	No of Images	Batches	Gaudi throughput (samples/sec)	Time for 1 Image Generation (sec)
1	40	384 - padded	5	5	0.067	14.92537313
1	40	333 - default	5	5	0.055	18.18181818

Observation from Profile Trace

The performance gain is achieved because this tensor reshaping reduces the number of softmax kernels used to just one-third of the total compared to the default execution
The information of softmax kernels is summarized below for both default and optimized case

Type of execution	Op Name	Num Nodes	Avg single engine active (us)
Default	softmax_fwd_hf8	14440	145074.07
Default	softmax_fwd_hf8 iter 2	14440	145071.953
optimized	softmax_fwd_hf8	8664	145795.544
optimized	softmax_fwd_bf16	32	1308.537

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

…tilization

deepak-gowda-narayana · 2025-03-04T22:47:51Z

output from make style

ruff check . setup.py --fix
Found 1 error (1 fixed, 0 remaining).
ruff format . setup.py
1 file reformatted, 430 files left unchanged

deepak-gowda-narayana · 2025-03-04T22:48:16Z

@dsocek Please review PR

dsocek

LGTM

Padding Text Embeddings for softmax_hf8 compatibility and Efficient U…

6ffb3d5

…tilization

deepak-gowda-narayana requested a review from regisss as a code owner March 4, 2025 22:46

dsocek approved these changes Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816

Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816

deepak-gowda-narayana commented Mar 4, 2025 •

edited

Loading

deepak-gowda-narayana commented Mar 4, 2025

deepak-gowda-narayana commented Mar 4, 2025

dsocek left a comment

Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816

Are you sure you want to change the base?

Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816

Conversation

deepak-gowda-narayana commented Mar 4, 2025 • edited Loading

What does this PR do?

Optimization :

Observation from Profile Trace

Before submitting

deepak-gowda-narayana commented Mar 4, 2025

deepak-gowda-narayana commented Mar 4, 2025

dsocek left a comment

Choose a reason for hiding this comment

deepak-gowda-narayana commented Mar 4, 2025 •

edited

Loading