NVIDIA-NeMo · pzelasko · May 7, 2026 · May 7, 2026 · May 7, 2026 · May 7, 2026
diff --git a/docs/source/dataloaders.rst b/docs/source/dataloaders.rst
@@ -685,3 +685,61 @@ Other, more exotic configurations:
 * With ``seed="trng"``, the base random seed itself will be drawn using a TRNG. It will be different on each GPU training process. This setting is not recommended.
 
 * With ``seed="randomized"``, the base random seed is set to Python's global RNG seed. It might be different on each GPU training process. This setting is not recommended.
+
+CP/TP-safe batches with ``BroadcastingDataLoader``
+---------------------------------------------------
+
+Context-parallel (CP) and tensor-parallel (TP) training require all ranks
+within the same ``(cp, tp)`` sub-mesh of a DP slot to process the **same**
+global batch each step — CP shards the sequence dimension and TP shards
+the feature dimension, so a divergent global batch breaks the per-rank
+shape contract that CP/TP collectives assume.
+
+Independent Lhotse loaders on each rank with ``shard_seed="randomized"``
+guarantee that *seeded* shard cursors line up, but they don't protect
+against background-thread non-determinism (``concurrent_bucketing``,
+worker scheduling jitter, etc.). The empirical signature is per-rank
+``cu_seqlens`` divergence at a fraction of training steps, which then
+deadlocks NCCL collectives with mismatched shapes.
+
+The :class:`~nemo.collections.common.data.lhotse.broadcasting.BroadcastingDataLoader`
+fixes this at the data layer: construct the real Lhotse loader on a
+single DP-source rank (``cp_rank == 0`` and ``tp_rank == 0``) and let the
+wrapper broadcast each batch to the other ranks in the ``(cp, tp)``
+sub-mesh over NCCL. Iteration ends in lockstep via a continue/stop
+broadcast — no length needs to be known up-front.
+
+.. code-block:: python
+
+    from torch.distributed.device_mesh import init_device_mesh
+
+    from nemo.collections.common.data.lhotse import get_lhotse_dataloader_from_config
+    from nemo.collections.common.data.lhotse.broadcasting import (
+        BroadcastingDataLoader,
+        is_dp_source_rank,
+    )
+
+    mesh = init_device_mesh("cuda", (dp, cp, tp), mesh_dim_names=("dp", "cp", "tp"))
+
+    if is_dp_source_rank(mesh):
+        source = get_lhotse_dataloader_from_config(
+            config=cfg.train_ds,
+            global_rank=dp_rank,
+            world_size=dp_size,
+            dataset=dataset,
+            tokenizer=tokenizer,
+        )
+    else:
+        source = None
+
+    return BroadcastingDataLoader(source=source, device_mesh=mesh)
+
+The wrapper delegates ``state_dict`` / ``load_state_dict`` to the source
+loader on the source rank (no-ops on non-source ranks), so checkpoint and
+resume keep working transparently with regular ``DataLoader``,
+``torchdata.StatefulDataLoader``, or any other source object that
+implements those methods.
+
+The wrapper is a no-op when ``device_mesh`` is ``None`` or every named
+axis present in the mesh has size 1, so the same call site works for
+single-GPU, DDP-only, and CP/TP runs without a separate code path.
diff --git a/docs/source/speechlm2/configs.rst b/docs/source/speechlm2/configs.rst
@@ -229,6 +229,32 @@ Defaults come from Automodel's ``BackendConfig`` and auto-select TransformerEngi
 DeepEP when available; override here to pin a specific backend (for example,
 ``attn: sdpa`` to bypass TE).
 
+**Packed sequences (THD):**
+
+.. code-block:: yaml
+
+    model:
+      packed_sequences: true   # default false (right-padded BSHD path)
+      automodel_backend:
+        attn: te               # THD path dispatches TE varlen FlashAttention
+
+When ``packed_sequences`` is true, ``SALMAutomodel.prepare_inputs`` packs
+each minibatch into a single flat ``[T_total, H]`` sequence with a
+``cu_seqlens`` index instead of right-padding to ``[B, T_max, H]``.
+``SALMAutomodel`` then forwards the THD metadata (``qkv_format``,
+``cu_seqlens``, ``position_ids``, ``max_seqlen``) through ``forward()`` to
+the LLM. The TE attention preprocessor splits the singular ``max_seqlen``
+into the ``max_seqlen_q`` / ``max_seqlen_kv`` pair that
+``DotProductAttention`` requires for ``qkv_format="thd"``. The packing also
+rounds each utterance's flat length up to a multiple of ``2 * cp_size`` so
+the same THD batch satisfies TE's CP DualChunkSwap contract — see the
+"Context Parallelism (CP)" subsection in
+:doc:`training_and_scaling` for the recommended pairing with ``cp_size > 1``.
+
+Padding overhead drops from ``O(B * (T_max - T_avg))`` to
+``O(per-utt rounding to 2*cp_size)``. Throughput improvement scales with
+the variance of utterance lengths in your bucketing.
+
 DuplexS2SModel Configuration
 -----------------------------
 

diff --git a/docs/source/speechlm2/training_and_scaling.rst b/docs/source/speechlm2/training_and_scaling.rst
@@ -183,8 +183,93 @@ For distributed inference, launch with ``torchrun``:
       inputs=path/to/manifest \
       ep_size=2
 
-Configuration
-^^^^^^^^^^^^^
+Packed Sequences (THD)
+""""""""""""""""""""""
+
+``SALMAutomodel`` supports an opt-in packed-sequence (``THD``) training and
+validation path that concatenates per-utterance text + audio embeddings into
+a single flat ``[T_total, H]`` sequence with a ``cu_seqlens`` index, instead
+of right-padding into the standard ``[B, T_max, H]`` (``BSHD``) layout. TE's
+varlen FlashAttention then operates segment-by-segment without ever attending
+across utterances, and Mamba's ``seq_idx`` is derived from the same
+``cu_seqlens`` so SSM state resets at document boundaries.
+
+For variable-length speech batches the padding overhead is substantial — the
+``BSHD`` layout pays ``B * (T_max - T_avg)`` wasted compute per minibatch,
+``THD`` pays only the per-utterance rounding to a multiple of ``2*cp_size``
+(needed for TE's CP DualChunkSwap pattern). Throughput improvement scales
+with the variance of utterance lengths.
+
+Enable per-batch:
+
+.. code-block:: yaml
+
+    model:
+      packed_sequences: true   # opt-in; default false (BSHD)
+      automodel_backend:
+        attn: te                # THD path requires TE attention
+
+When ``packed_sequences`` is unset, the existing BSHD path is used unchanged.
+Generate / inference always uses BSHD (it doesn't go through ``prepare_inputs``).
+
+Context Parallelism (CP)
+""""""""""""""""""""""""
+
+``SALMAutomodel`` supports context parallelism for long-audio training on
+hybrid Mamba/attention LLMs (e.g. Nemotron-V3). CP shards the sequence
+dimension across GPUs so per-rank activations and KV-cache memory scale as
+``T / cp_size`` instead of ``T``; attention layers go through TE's
+DualChunkSwap pattern and Mamba mixers go through hidden-parallel
+all-to-all (``MambaContextParallel`` in NeMo Automodel).
+
+Enable via the strategy:
+
+.. code-block:: yaml
+
+    trainer:
+      strategy:
+        _target_: nemo.collections.speechlm2.parts.parallel.AutomodelParallelStrategy
+        cp_size: 2          # context parallel size; must divide num_heads of every Mamba block
+        ep_size: 2          # may share the same ranks as CP
+
+**The THD packed-sequence path is the only supported configuration under
+CP.** Each utterance is its own attention segment and the per-utterance
+sequence rounding aligns naturally with CP's ``2*cp_size`` requirement.
+
+.. warning::
+   **BSHD + CP is not supported.** TE's fused-attention CP path supports
+   ``causal`` but not ``padding_causal``, so the right-pad mask must be
+   dropped before the LLM. With the mask dropped, pad K/V leak into
+   real-token attention through the causal mask and the gradient through
+   the LoRA / projection parameters becomes ``NaN`` after the first
+   optimizer step (validated empirically: BSHD + CP=2 + EP=2 on a 2-GPU
+   run produces ``loss=4.62`` at step 1 then ``loss=nan`` from step 2
+   onwards). This is independent of the TE/cuDNN backward issue
+   documented below — setting ``NVTE_FUSED_ATTN=0`` does not fix it.
+   Set ``model.packed_sequences: true`` to use the THD path instead.
+
+.. note::
+   **CP-safe data loading is automatic.** The speechlm2 datamodule wraps
+   the Lhotse loader in
+   :class:`~nemo.collections.common.data.lhotse.broadcasting.BroadcastingDataLoader`,
+   so under CP/TP every batch is constructed once on the DP source rank
+   (``cp_rank == 0`` and ``tp_rank == 0``) and broadcast to its sub-mesh
+   peers. This eliminates per-rank Lhotse non-determinism (``concurrent_bucketing``,
+   worker scheduling jitter, etc.) as a source of NCCL deadlocks under CP.
+   See :doc:`/dataloaders` for the standalone API.
+
+.. note::
+   **TE/THD exploding-gradients workaround on some GPUs.** On certain GPU
+   architectures (notably Blackwell ``sm_120``), the cuDNN backend that
+   TransformerEngine 2.14 picks for ``qkv_format="thd"`` with
+   ``attn_mask_type="padding_causal"`` returns correct forward activations
+   but gradients amplified 8×–960× per layer. Compounded across the LLM's
+   attention stack this drives gradients to ``1e22``-magnitudes at step 0,
+   the gradient-clip-by-norm computes ``1.0 / inf = 0``, and Adam's moments
+   eventually NaN. Force TE to dispatch FlashAttention instead of cuDNN by
+   setting ``NVTE_FUSED_ATTN=0`` in the launcher environment (requires
+   ``flash-attn`` to be installed for your GPU arch). The FlashAttention
+   THD/``padding_causal`` backward is gradient-correct on the same shapes.
 
 To configure parallelism, modify the ``trainer.strategy`` section in your YAML config: