FP8 Enablement for TE layers + HSTU attn by esoba · Pull Request #197 · NVIDIA/recsys-examples

esoba · 2025-10-21T23:36:53Z

Added MixedPrecisionArgs to allow users to configure FP8 usage in TE linear layers and HSTU attention for Native HSTU layer. Features include:

New MixedPrecisionArgs in gin config
Support for TE fp8 autocast in pipeline
Ability to separate TE FP8 from HSTU FP8 attn
Truncation/padding logic to enable TE FP8 divisible by 16 requirement

Minimal working example PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 2 --master_addr localhost --master_port 6000 pretrain_gr_ranking.py --gin-config-file movielens_ranking_fp8.gin

Setup currently has a bug when both TE linear layer and HSTU attn are fp8 enabled, seeing NaN loss at iteration 64. I have a debugging branch here that tracks the forward pass and associated fp8 metadata for easier debugging. I tried to repro the issue here with some dummy inputs, and it has run successfully - have a hunch that there are NaN gradients flowing back into embedding table.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…line in te amp cm

Te fp8 wrapper

…ant_mode argument name in python API

Merge to main

esoba · 2025-10-21T23:40:38Z

examples/hstu/movielens_ranking_fp8.gin

+
+TensorModelParallelArgs.tensor_model_parallel_size = 2
+
+# MixedPrecisionArgs.mixed_precision_dtype = "fp8"


Uncommenting this line should throw error at iteration 64 related to NaN loss

reproduce command

PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 2 --master_addr localhost --master_port 6000 pretrain_gr_ranking.py --gin-config-file movielens_ranking_fp8.gin

shijieliu · 2025-10-22T01:21:14Z

examples/hstu/modules/native_hstu_layer.py

                target_group_size=self._target_group_size,
            )
+
+            # TODO: Remove this once the attention kernel outputs consistent dtype


@JacoCheung could you double check why we need this?

When training with fp8 enabled, the model weight could be bf16/fp16 （NetworkArgs.dtype_str) (Usually it's bf16). So as activation is.

But the hstu kernel output is fp16, so here we need a cast between fp16->bf16. @shijieliu

@esoba do you think if there's a need to move the cast into the kernel or not?

I think casting fp16 to bf16 wouldn't result in error since dynamic range is larger, but would assume some additional quantization error pops up (ideally that gets learned by model anyways). I think for consistency it would probably be better to move it into the kernel but as a workaround casting outside should be fine.

XinboZhao · 2025-10-24T01:40:51Z

I will help to review the code.

examples/hstu/modules/utils.py

JacoCheung · 2025-10-31T07:55:44Z

examples/hstu/modules/hstu_processor.py

+
        return jd

+    def _align_jagged_data_for_fp8(


Hi @esoba , Since you have padding here, there should be a discarding process at the postprocessor before loss compute. That's being said,

final_loss = drop_pad_values(final_loss) final_loss.mean().backward()

Otherwise, the padded token will impact the backward both data gradient and weight gradient even if the padded value is initialized as 0.

See our loss calculation.

And our post-processor (if it's ranking) will take the padded token as normal token..

I believe I was seeing the issue when I set this to truncate (cut off the last N elements to get nearest divisible by 16), let me double check this to see if there is any undefined behavior doing it this way as well. Thanks for the catch!

JacoCheung · 2026-03-11T07:45:30Z

FBGEMM_GPU HSTU has been integrated into recsys-example #321 .

FYI.
@XinboZhao You can continue fp8 exploration based on current repo.

esoba and others added 13 commits September 18, 2025 13:14

Added FP8 args to gin config + dataclasses

fe517ed

Added te linear + safe import to MLP

86e31e7

Added flags for HSTU FP8 computation + wrapped forward method of pipe…

ebbaa4f

…line in te amp cm

Merge pull request #2 from esoba/te_fp8_wrapper

446d932

Te fp8 wrapper

Merge branch 'NVIDIA:main' into main

d65386a

Working implementation of FP8 for ranking

4773714

Added TE dependency in Dockerfiles, updated HSTU readme to reflect qu…

b67b2bf

…ant_mode argument name in python API

propagate quant mode to fused layer

4ad689f

Added debugging statements, found fix in to float16

d7c9ab3

removed debugging print statements

16360d8

Cleaned up print statements, started testing retrieval gin

9647e76

Merge pull request #3 from esoba/merge_to_main

ffb7900

Merge to main

Moved HSTU attn quantization outside of mp args enablement

782fc66

esoba commented Oct 21, 2025

View reviewed changes

shijieliu reviewed Oct 22, 2025

View reviewed changes

JacoCheung reviewed Oct 24, 2025

View reviewed changes

examples/hstu/modules/utils.py Show resolved Hide resolved

JacoCheung reviewed Oct 31, 2025

View reviewed changes

JacoCheung mentioned this pull request Nov 4, 2025

[FEA] Enable Sequence Parallelism #212

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 Enablement for TE layers + HSTU attn#197

FP8 Enablement for TE layers + HSTU attn#197
esoba wants to merge 13 commits intoNVIDIA:mainfrom
esoba:main

esoba commented Oct 21, 2025

Uh oh!

esoba Oct 21, 2025

Uh oh!

shijieliu Oct 22, 2025

Uh oh!

shijieliu Oct 22, 2025

Uh oh!

JacoCheung Oct 23, 2025 •

edited

Loading

Uh oh!

esoba Oct 23, 2025

Uh oh!

XinboZhao commented Oct 24, 2025

Uh oh!

Uh oh!

JacoCheung Oct 31, 2025 •

edited

Loading

Uh oh!

esoba Oct 31, 2025

Uh oh!

JacoCheung commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		TensorModelParallelArgs.tensor_model_parallel_size = 2

		# MixedPrecisionArgs.mixed_precision_dtype = "fp8"

Conversation

esoba commented Oct 21, 2025

Checklist

Uh oh!

esoba Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

shijieliu Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

shijieliu Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

JacoCheung Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esoba Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

XinboZhao commented Oct 24, 2025

Uh oh!

Uh oh!

JacoCheung Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esoba Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

JacoCheung commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JacoCheung Oct 23, 2025 •

edited

Loading

JacoCheung Oct 31, 2025 •

edited

Loading