Optimize SD3 Pipeline : Padding prompt Embeddings for softmax_hf8 compatibility and Efficient Utilization #1816
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
SD3 pipeline was running softmax kernels with inconsistent tensor shapes leading to low utilization. Observed from the perf lib logs as shown below
The Geometry of the input and the kernel geometry are inconsistent in softmax_hf8 on account of which the kernel utilization is less than 50 percent and equals to 25.000000 percent
Optimization :
This PR addresses the issue of low utilization by padding prompt embeddings to a shape compatible with softmax_hf8 kernels, leading to better utilization and faster execution of diffusion attention process.
The below table shows the performance change from default execution which uses the regular prompt embedding shape of 333 vs the padded version using 384.
Observation from Profile Trace
The performance gain is achieved because this tensor reshaping reduces the number of softmax kernels used to just one-third of the total compared to the default execution
The information of softmax kernels is summarized below for both default and optimized case
Before submitting