Skip to content

Resumable lhotse dataloader#15777

Draft
pzelasko wants to merge 25 commits into
mainfrom
stateful-restorable-lhotse-dataloader
Draft

Resumable lhotse dataloader#15777
pzelasko wants to merge 25 commits into
mainfrom
stateful-restorable-lhotse-dataloader

Conversation

@pzelasko

@pzelasko pzelasko commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

This is a pretty large PR using a new dataloader design in Lhotse. See the docs and tutorials in the related lhotse PR (below) to understand the design.

Unlike previous approaches it allows quick resumption, 100% determinism, and improved sampling randomness for sharded / tarred / lhotse shar data. I noticed it helps to improve the results for models trained on many datasets blended using mux - sometimes a lot, depending on how large is the data and how it's physically sharded.

I've been validating it for a while and I managed to train some successful models using it, but I might do some more hardening before merging.

From NeMo Speech user perspective it's not invasive: it will work with existing data and training configs, and requires only to build binary indexes for manifests, and adjust a few settings in the dataloader config.

There is a dedicated agent skill for migrating existing training recipe to new the dataloader.

Related Lhotse PR: lhotse-speech/lhotse#1570

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

pzelasko added 17 commits May 1, 2026 16:20
…ataloader

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…al gap coverage

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…h; changes to allow byte-range reading from AIS

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
dataloader.py: _maybe_init_main_process_for_iterable eagerly calls
worker_init_fn(0, ...) when num_workers=0 so the iterable path with
no worker subprocesses still gets DP×worker partition wired up.
Wired into both get_lhotse_dataloader_from_{single,multi}_config
iterable branches.

ASR + SpeechLM2 dataset wrappers: ais_prefer_individual ->
ais_force_individual; tracks the lhotse-side rename. The
USE_AIS_INDIVIDUAL_GETS env-var contract is unchanged.

Skill docs updated for: the rename, the new byte-range shar_ptr
fallback in AISBatchLoader, iterable-mode partition semantics, the
0-byte .idx race repro + atomic _write_index fix, and the
prefetch-manifests / prefetch-indexes co-design (merged into the
manifest prefetch script so .idx sidecars land at the resolve_idx_path
slot the rewritten YAML will look up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…r; dedupe against lhotse primitives

The bug: every indexed adapter's _iter_indexed iterated range(0, total_len)
with no get_worker_partition() call, so under multi-rank training every DP
rank x DataLoader worker yielded the same item sequence. Observed in the
0909-id sweep: 32 ranks all skipped the same 7 bad-path librilight cuts
instead of 32x7=224 disjoint ones (full diagnosis in
agent-debug-workspace/0909-en-only-id-4node/DIAGNOSIS.md).

Adapter refactor (7 leaves now partition-aware):
* LazyNeMoTarredIterator (nemo_adapters.py)
* LazyParquetIterator (nemo_adapters.py)
* LhotseTextJsonlAdapter, NeMoSFTJsonlAdapter (text_adapters.py)
* NeMoMultimodalConversationJsonlAdapter (text_adapters.py)
* NeMoMultimodalConversationShareGPTJsonlAdapter (text_adapters.py)
* NeMoMultimodalConversationShareGPTWebdatasetAdapter (text_adapters.py)
Each holds self._iter_state = PartitionedIndexedIterator() and replaces its
local _position/_restored/range(start, n) loop with a one-line delegation:
``for global_idx in self._iter_state.iterate(self._total_len): ...``.
state_dict/load_state_dict forward to the helper plus the adapter's own
epoch counter. The 8th indexed adapter (LazyNeMoIterator) was already
correct because it delegates to lhotse's LazyIndexedManifestIterator.

Dedup against lhotse primitives (indexed_adapters.py, -170 LOC):
* Drop local LazyShuffledRange and IndexedJSONLReader; use the lhotse-side
  classes (lhotse.indexing.LazyShuffledRange / IndexedJsonlReader).
* Drop create_index; use lhotse.indexing.create_jsonl_index.
* _load_index now delegates the offsets-loading to lhotse.indexing.read_index
  and layers only the NeMo-specific validation (file-size cross-check +
  legacy-format sentinel handling) on top.
Kept: resolve_idx_path, IndexedTarSampleReader, IndexedTarMemberReader,
create_tar_index (NeMo-style tar with basename-grouped members).

Remove non-partition-aware index code paths (user directive):
* Delete _iter_jsonl_indexed in NeMoMultimodalConversationShareGPTJsonlAdapter
  and _iter_indexed in NeMoMultimodalConversationShareGPTWebdatasetAdapter.
  Both used .idx files for shuffling without partitioning the result. The
  __iter__ dispatch now routes index-driven access exclusively through the
  partition-aware _iter_indexed_node (indexed=True). Non-indexed paths
  keep using shard-level DP partitioning via streaming. Remove the now-
  unused self._has_index plumbing.

Tests:
* New tests/collections/common/test_lhotse_indexed_partition.py: 28 tests
  (7 adapters x 4 world sizes {1,2,4,5}) asserting per-rank slices are
  pairwise disjoint and the union covers the manifest exactly once.
* test_lhotse_multimodal_dataloading.py: delete 6 tests that asserted the
  removed non-partition-aware fallback paths, plus the 3 NeMoLazyShuffledRange
  tests now covered by lhotse-side test_indexing.py. Drop _has_index
  assertions from the surviving tests.

58/58 NeMo tests + 179/179 lhotse tests pass under nemo312-hf5 with
PYTHONPATH=lhotse_resumable:NeMo_resumable.

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
End-to-end harness that exercises any train_ds-shaped Lhotse config:
torchrun-launched per-rank entry dumps per-batch cut.id JSONL, saves
state_dict at a configurable step, and a separate resumed-phase
torchrun loads it and continues. Post-iteration consolidate.py
verifies five properties:

  Q1 no cut.id appears on >1 (rank, worker)
  Q2 union of yielded IDs equals ground-truth enumeration
  Q3 per-rank cut sets are pairwise disjoint
  Q4 resumed cells match baseline tail bit-for-bit (off-by-one
     aware: state captures position AFTER yielding step K, so
     resumed[0] == baseline[K+1])
  Q5 two independent runs with the same seed yield identical
     (rank, step) cut sets

Plus 16 static pre-validation checks (seed types, indexed/stateful
flags, idx-file presence, constant-time-leaves contract, mux weight
sanity, multi_config flag propagation, bucketer buffer heuristic,
text_field sanity).

Caught the LazyIndexedSharIterator partition leak fixed in the
companion lhotse commit (4072 AMI cuts duplicated across 8 DP
ranks under iterable+indexed mode); now reports clean PASS on the
0909-en-only-id2 recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…ss timer pre-emption patch (willfix)

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…hotse-dataloader

# Conflicts:
#	CLAUDE.md
#	docs/source/dataloaders.rst
#	nemo/collections/common/data/lhotse/nemo_adapters.py
#	nemo/collections/speechlm2/data/datamodule.py
#	nemo/collections/speechlm2/models/salm_automodel.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added core Changes to NeMo Core ASR common audio labels Jun 9, 2026
Comment thread nemo/collections/common/data/lhotse/nemo_adapters.py Fixed
Comment thread nemo/collections/common/data/lhotse/nemo_adapters.py Fixed
Comment thread nemo/collections/common/data/lhotse/nemo_adapters.py Fixed
Comment thread nemo/collections/common/data/lhotse/text_adapters.py Fixed
Comment thread nemo/collections/common/data/lhotse/text_adapters.py Fixed
Comment thread scripts/dataloading/prefetch_indexes.py Fixed
Comment thread scripts/dataloading/prefetch_indexes.py Fixed
Comment thread tests/collections/common/test_lhotse_indexed_partition.py Fixed
Comment thread tests/collections/common/test_lhotse_indexed_partition.py Fixed
Comment thread tests/collections/common/test_lhotse_indexed_partition.py Fixed
@kobenaxie

Copy link
Copy Markdown

How to use lhotse dataloader to load Webdataset which contains wav and txt in tar file.

@pzelasko

Copy link
Copy Markdown
Collaborator Author

How to use lhotse dataloader to load Webdataset which contains wav and txt in tar file.

@kobenaxie To handle this you'll need to write a similar class to LazyNeMoTarredIterator to map your specific WDS format to a Cut and add a @data_type_parser to register that under some format name; you can also check the dataloading documentation in this PR which describes these things a little better than was available before.

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Comment thread nemo/collections/common/data/lhotse/text_adapters.py Fixed
pzelasko added 3 commits June 12, 2026 08:55
…hotse-dataloader

# Conflicts:
#	nemo/collections/speechlm2/models/salm_automodel.py
#	nemo/collections/speechlm2/parts/cp_helpers.py
#	nemo/collections/speechlm2/parts/encoder_chunking.py
#	tests/collections/speechlm2/test_salm_cp_helpers.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
@pzelasko pzelasko force-pushed the stateful-restorable-lhotse-dataloader branch from dcdcb09 to 9f47392 Compare June 12, 2026 18:18
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Comment thread nemo/collections/common/data/lhotse/_compat.py Fixed
Comment thread nemo/collections/common/data/lhotse/_compat.py Fixed
Comment thread nemo/collections/common/data/lhotse/_compat.py Fixed
pzelasko added 2 commits June 12, 2026 11:42
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants