Skip to content

refactor: move HSTU build to devel stage#325

Merged
JacoCheung merged 3 commits intoNVIDIA:mainfrom
shijieliu:fix-hstu_building
Mar 13, 2026
Merged

refactor: move HSTU build to devel stage#325
JacoCheung merged 3 commits intoNVIDIA:mainfrom
shijieliu:fix-hstu_building

Conversation

@shijieliu
Copy link
Collaborator

Move fbgemm_gpu_hstu compilation from the build stage into the devel (base) stage so it is baked into the base image. Also remove the MAX_JOBS=2 constraint.

Made-with: Cursor

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Move fbgemm_gpu_hstu compilation from the build stage into the devel
(base) stage so it is baked into the base image. Also remove the
MAX_JOBS=2 constraint.

Made-with: Cursor
@greptile-apps
Copy link

greptile-apps bot commented Mar 12, 2026

Greptile Summary

This PR refactors the Docker build pipeline by moving the fbgemm_gpu_hstu compilation from the build stage into the devel (base) stage, so HSTU is baked into the published base image rather than recompiled on every application build. MAX_JOBS=2 is also removed to allow full-parallelism compilation, and the FBGEMM submodule is bumped to a newer commit.

Key changes:

  • COPY third_party/FBGEMM + RUN pip install . (HSTU build) relocated from builddevel stage
  • The redundant COPY . . / RUN rm -rf * pattern at the end of the old devel stage is removed
  • build stage simplified to just COPY . . + project-specific builds
  • FBGEMM submodule bumped from 04df536 to df29adb
  • MAX_JOBS=2 cap removed, allowing the compiler to use all available cores

Confidence Score: 3/5

  • Mostly safe to merge, but the devel stage now has a hard dependency on submodule initialization that should be documented, and the HSTU_ARCH_LIST / HSTU_DISABLE_86OR89 mismatch is worth resolving.
  • The structural refactoring is correct and achieves the stated goal. The two concerns — missing submodule-init documentation for the devel build target, and the HSTU_ARCH_LIST not including 8.6/8.9 despite HSTU_DISABLE_86OR89=FALSE — are moderate issues. Neither is a regression (the arch list mismatch existed before), but the submodule dependency is now more prominent since it blocks building the published base image.
  • docker/Dockerfile — verify submodule initialization is enforced in any CI/automation that builds the devel target, and consider aligning HSTU_ARCH_LIST with the disable flags.

Important Files Changed

Filename Overview
docker/Dockerfile Moves HSTU build from build stage into devel stage to bake it into the base image; removes MAX_JOBS=2; eliminates the redundant COPY . . / rm -rf pattern in devel. The restructuring is logically sound but introduces a hard dependency on submodule initialization when building the devel target, and the HSTU_ARCH_LIST / HSTU_DISABLE_86OR89 mismatch is carried forward.
third_party/FBGEMM Bumps the FBGEMM submodule pointer from 04df536 to df29adb. Intentional update to pick up HSTU changes needed by the devel-stage build; no code changes in this repo.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["FROM BASE_IMAGE AS devel"] --> B["Install system deps\n(apt, pip: megatron, fbgemm, torchrec…)"]
    B --> C["COPY third_party/FBGEMM\n(requires submodule init)"]
    C --> D["RUN pip install fbgemm_gpu_hstu\n(HSTU baked into devel image)"]
    D --> E["Push as DEVEL_IMAGE\n(base image with HSTU)"]

    E --> F["FROM DEVEL_IMAGE AS build"]
    F --> G["COPY . .\n(full repo incl. third_party/FBGEMM)"]
    G --> H["RUN build dynamicemb"]
    H --> I["RUN build commons"]
    I --> J["Final application image"]

    style C fill:#f9f,stroke:#333
    style D fill:#f9f,stroke:#333
    style E fill:#bbf,stroke:#333
Loading

Comments Outside Diff (1)

  1. docker/Dockerfile, line 68 (link)

    HSTU_ARCH_LIST does not cover Ampere 8.6 / Ada 8.9 despite HSTU_DISABLE_86OR89=FALSE

    HSTU_DISABLE_86OR89=FALSE signals that the HSTU code should compile kernels for SM 8.6 (A10/A40) and SM 8.9 (L4/L40) architectures, but HSTU_ARCH_LIST="8.0 9.0" does not include those SMs. Depending on how the build system consumes these two variables, the result could be that 8.6/8.9 kernels are silently skipped despite the intent to enable them.

    Consider aligning the arch list with the disable flag:

    This was the same in the removed build stage, but since this is now the canonical location for this build, it is worth fixing here.

Last reviewed commit: 08c0e0b

@JacoCheung JacoCheung self-requested a review March 12, 2026 11:53
…ache

The broad COPY . . before the HSTU build invalidated the compilation
layer on every repo change. Remove the unnecessary COPY . . and
rm -rf pair from the devel stage so only changes to third_party/FBGEMM
trigger an HSTU rebuild.

Made-with: Cursor
@JacoCheung JacoCheung merged commit 6657399 into NVIDIA:main Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants