Skip to content
58 changes: 58 additions & 0 deletions docs/source/dataloaders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -685,3 +685,61 @@ Other, more exotic configurations:
* With ``seed="trng"``, the base random seed itself will be drawn using a TRNG. It will be different on each GPU training process. This setting is not recommended.

* With ``seed="randomized"``, the base random seed is set to Python's global RNG seed. It might be different on each GPU training process. This setting is not recommended.

CP/TP-safe batches with ``BroadcastingDataLoader``
---------------------------------------------------

Context-parallel (CP) and tensor-parallel (TP) training require all ranks
within the same ``(cp, tp)`` sub-mesh of a DP slot to process the **same**
global batch each step — CP shards the sequence dimension and TP shards
the feature dimension, so a divergent global batch breaks the per-rank
shape contract that CP/TP collectives assume.

Independent Lhotse loaders on each rank with ``shard_seed="randomized"``
guarantee that *seeded* shard cursors line up, but they don't protect
against background-thread non-determinism (``concurrent_bucketing``,
worker scheduling jitter, etc.). The empirical signature is per-rank
``cu_seqlens`` divergence at a fraction of training steps, which then
deadlocks NCCL collectives with mismatched shapes.

The :class:`~nemo.collections.common.data.lhotse.broadcasting.BroadcastingDataLoader`
fixes this at the data layer: construct the real Lhotse loader on a
single DP-source rank (``cp_rank == 0`` and ``tp_rank == 0``) and let the
wrapper broadcast each batch to the other ranks in the ``(cp, tp)``
sub-mesh over NCCL. Iteration ends in lockstep via a continue/stop
broadcast — no length needs to be known up-front.

.. code-block:: python

from torch.distributed.device_mesh import init_device_mesh

from nemo.collections.common.data.lhotse import get_lhotse_dataloader_from_config
from nemo.collections.common.data.lhotse.broadcasting import (
BroadcastingDataLoader,
is_dp_source_rank,
)

mesh = init_device_mesh("cuda", (dp, cp, tp), mesh_dim_names=("dp", "cp", "tp"))

if is_dp_source_rank(mesh):
source = get_lhotse_dataloader_from_config(
config=cfg.train_ds,
global_rank=dp_rank,
world_size=dp_size,
dataset=dataset,
tokenizer=tokenizer,
)
else:
source = None

return BroadcastingDataLoader(source=source, device_mesh=mesh)

The wrapper delegates ``state_dict`` / ``load_state_dict`` to the source
loader on the source rank (no-ops on non-source ranks), so checkpoint and
resume keep working transparently with regular ``DataLoader``,
``torchdata.StatefulDataLoader``, or any other source object that
implements those methods.

The wrapper is a no-op when ``device_mesh`` is ``None`` or every named
axis present in the mesh has size 1, so the same call site works for
single-GPU, DDP-only, and CP/TP runs without a separate code path.
26 changes: 26 additions & 0 deletions docs/source/speechlm2/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,32 @@ Defaults come from Automodel's ``BackendConfig`` and auto-select TransformerEngi
DeepEP when available; override here to pin a specific backend (for example,
``attn: sdpa`` to bypass TE).

**Packed sequences (THD):**

.. code-block:: yaml

model:
packed_sequences: true # default false (right-padded BSHD path)
automodel_backend:
attn: te # THD path dispatches TE varlen FlashAttention

When ``packed_sequences`` is true, ``SALMAutomodel.prepare_inputs`` packs
each minibatch into a single flat ``[T_total, H]`` sequence with a
``cu_seqlens`` index instead of right-padding to ``[B, T_max, H]``.
``SALMAutomodel`` then forwards the THD metadata (``qkv_format``,
``cu_seqlens``, ``position_ids``, ``max_seqlen``) through ``forward()`` to
the LLM. The TE attention preprocessor splits the singular ``max_seqlen``
into the ``max_seqlen_q`` / ``max_seqlen_kv`` pair that
``DotProductAttention`` requires for ``qkv_format="thd"``. The packing also
rounds each utterance's flat length up to a multiple of ``2 * cp_size`` so
the same THD batch satisfies TE's CP DualChunkSwap contract — see the
"Context Parallelism (CP)" subsection in
:doc:`training_and_scaling` for the recommended pairing with ``cp_size > 1``.

Padding overhead drops from ``O(B * (T_max - T_avg))`` to
``O(per-utt rounding to 2*cp_size)``. Throughput improvement scales with
the variance of utterance lengths in your bucketing.

DuplexS2SModel Configuration
-----------------------------

Expand Down
89 changes: 87 additions & 2 deletions docs/source/speechlm2/training_and_scaling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,8 +183,93 @@ For distributed inference, launch with ``torchrun``:
inputs=path/to/manifest \
ep_size=2

Configuration
^^^^^^^^^^^^^
Packed Sequences (THD)
""""""""""""""""""""""

``SALMAutomodel`` supports an opt-in packed-sequence (``THD``) training and
validation path that concatenates per-utterance text + audio embeddings into
a single flat ``[T_total, H]`` sequence with a ``cu_seqlens`` index, instead
of right-padding into the standard ``[B, T_max, H]`` (``BSHD``) layout. TE's
varlen FlashAttention then operates segment-by-segment without ever attending
across utterances, and Mamba's ``seq_idx`` is derived from the same
``cu_seqlens`` so SSM state resets at document boundaries.

For variable-length speech batches the padding overhead is substantial — the
``BSHD`` layout pays ``B * (T_max - T_avg)`` wasted compute per minibatch,
``THD`` pays only the per-utterance rounding to a multiple of ``2*cp_size``
(needed for TE's CP DualChunkSwap pattern). Throughput improvement scales
with the variance of utterance lengths.

Enable per-batch:

.. code-block:: yaml

model:
packed_sequences: true # opt-in; default false (BSHD)
automodel_backend:
attn: te # THD path requires TE attention

When ``packed_sequences`` is unset, the existing BSHD path is used unchanged.
Generate / inference always uses BSHD (it doesn't go through ``prepare_inputs``).

Context Parallelism (CP)
""""""""""""""""""""""""

``SALMAutomodel`` supports context parallelism for long-audio training on
hybrid Mamba/attention LLMs (e.g. Nemotron-V3). CP shards the sequence
dimension across GPUs so per-rank activations and KV-cache memory scale as
``T / cp_size`` instead of ``T``; attention layers go through TE's
DualChunkSwap pattern and Mamba mixers go through hidden-parallel
all-to-all (``MambaContextParallel`` in NeMo Automodel).

Enable via the strategy:

.. code-block:: yaml

trainer:
strategy:
_target_: nemo.collections.speechlm2.parts.parallel.AutomodelParallelStrategy
cp_size: 2 # context parallel size; must divide num_heads of every Mamba block
ep_size: 2 # may share the same ranks as CP

**The THD packed-sequence path is the only supported configuration under
CP.** Each utterance is its own attention segment and the per-utterance
sequence rounding aligns naturally with CP's ``2*cp_size`` requirement.

.. warning::
**BSHD + CP is not supported.** TE's fused-attention CP path supports
``causal`` but not ``padding_causal``, so the right-pad mask must be
dropped before the LLM. With the mask dropped, pad K/V leak into
real-token attention through the causal mask and the gradient through
the LoRA / projection parameters becomes ``NaN`` after the first
optimizer step (validated empirically: BSHD + CP=2 + EP=2 on a 2-GPU
run produces ``loss=4.62`` at step 1 then ``loss=nan`` from step 2
onwards). This is independent of the TE/cuDNN backward issue
documented below — setting ``NVTE_FUSED_ATTN=0`` does not fix it.
Set ``model.packed_sequences: true`` to use the THD path instead.

.. note::
**CP-safe data loading is automatic.** The speechlm2 datamodule wraps
the Lhotse loader in
:class:`~nemo.collections.common.data.lhotse.broadcasting.BroadcastingDataLoader`,
so under CP/TP every batch is constructed once on the DP source rank
(``cp_rank == 0`` and ``tp_rank == 0``) and broadcast to its sub-mesh
peers. This eliminates per-rank Lhotse non-determinism (``concurrent_bucketing``,
worker scheduling jitter, etc.) as a source of NCCL deadlocks under CP.
See :doc:`/dataloaders` for the standalone API.

.. note::
**TE/THD exploding-gradients workaround on some GPUs.** On certain GPU
architectures (notably Blackwell ``sm_120``), the cuDNN backend that
TransformerEngine 2.14 picks for ``qkv_format="thd"`` with
``attn_mask_type="padding_causal"`` returns correct forward activations
but gradients amplified 8×–960× per layer. Compounded across the LLM's
attention stack this drives gradients to ``1e22``-magnitudes at step 0,
the gradient-clip-by-norm computes ``1.0 / inf = 0``, and Adam's moments
eventually NaN. Force TE to dispatch FlashAttention instead of cuDNN by
setting ``NVTE_FUSED_ATTN=0`` in the launcher environment (requires
``flash-attn`` to be installed for your GPU arch). The FlashAttention
THD/``padding_causal`` backward is gradient-correct on the same shapes.

To configure parallelism, modify the ``trainer.strategy`` section in your YAML config:

Expand Down
Loading
Loading