Resumable lhotse dataloader by pzelasko · Pull Request #15777 · NVIDIA-NeMo/NeMo

pzelasko · 2026-06-09T19:42:16Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

This is a pretty large PR using a new dataloader design in Lhotse. See the docs and tutorials in the related lhotse PR (below) to understand the design.

Unlike previous approaches it allows quick resumption, 100% determinism, and improved sampling randomness for sharded / tarred / lhotse shar data. I noticed it helps to improve the results for models trained on many datasets blended using mux - sometimes a lot, depending on how large is the data and how it's physically sharded.

I've been validating it for a while and I managed to train some successful models using it, but I might do some more hardening before merging.

From NeMo Speech user perspective it's not invasive: it will work with existing data and training configs, and requires only to build binary indexes for manifests, and adjust a few settings in the dataloader config.

There is a dedicated agent skill for migrating existing training recipe to new the dataloader.

Related Lhotse PR: lhotse-speech/lhotse#1570

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

…ataloader Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

…al gap coverage Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

…h; changes to allow byte-range reading from AIS Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

dataloader.py: _maybe_init_main_process_for_iterable eagerly calls worker_init_fn(0, ...) when num_workers=0 so the iterable path with no worker subprocesses still gets DP×worker partition wired up. Wired into both get_lhotse_dataloader_from_{single,multi}_config iterable branches. ASR + SpeechLM2 dataset wrappers: ais_prefer_individual -> ais_force_individual; tracks the lhotse-side rename. The USE_AIS_INDIVIDUAL_GETS env-var contract is unchanged. Skill docs updated for: the rename, the new byte-range shar_ptr fallback in AISBatchLoader, iterable-mode partition semantics, the 0-byte .idx race repro + atomic _write_index fix, and the prefetch-manifests / prefetch-indexes co-design (merged into the manifest prefetch script so .idx sidecars land at the resolve_idx_path slot the rewritten YAML will look up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

…r; dedupe against lhotse primitives The bug: every indexed adapter's _iter_indexed iterated range(0, total_len) with no get_worker_partition() call, so under multi-rank training every DP rank x DataLoader worker yielded the same item sequence. Observed in the 0909-id sweep: 32 ranks all skipped the same 7 bad-path librilight cuts instead of 32x7=224 disjoint ones (full diagnosis in agent-debug-workspace/0909-en-only-id-4node/DIAGNOSIS.md). Adapter refactor (7 leaves now partition-aware): * LazyNeMoTarredIterator (nemo_adapters.py) * LazyParquetIterator (nemo_adapters.py) * LhotseTextJsonlAdapter, NeMoSFTJsonlAdapter (text_adapters.py) * NeMoMultimodalConversationJsonlAdapter (text_adapters.py) * NeMoMultimodalConversationShareGPTJsonlAdapter (text_adapters.py) * NeMoMultimodalConversationShareGPTWebdatasetAdapter (text_adapters.py) Each holds self._iter_state = PartitionedIndexedIterator() and replaces its local _position/_restored/range(start, n) loop with a one-line delegation: ``for global_idx in self._iter_state.iterate(self._total_len): ...``. state_dict/load_state_dict forward to the helper plus the adapter's own epoch counter. The 8th indexed adapter (LazyNeMoIterator) was already correct because it delegates to lhotse's LazyIndexedManifestIterator. Dedup against lhotse primitives (indexed_adapters.py, -170 LOC): * Drop local LazyShuffledRange and IndexedJSONLReader; use the lhotse-side classes (lhotse.indexing.LazyShuffledRange / IndexedJsonlReader). * Drop create_index; use lhotse.indexing.create_jsonl_index. * _load_index now delegates the offsets-loading to lhotse.indexing.read_index and layers only the NeMo-specific validation (file-size cross-check + legacy-format sentinel handling) on top. Kept: resolve_idx_path, IndexedTarSampleReader, IndexedTarMemberReader, create_tar_index (NeMo-style tar with basename-grouped members). Remove non-partition-aware index code paths (user directive): * Delete _iter_jsonl_indexed in NeMoMultimodalConversationShareGPTJsonlAdapter and _iter_indexed in NeMoMultimodalConversationShareGPTWebdatasetAdapter. Both used .idx files for shuffling without partitioning the result. The __iter__ dispatch now routes index-driven access exclusively through the partition-aware _iter_indexed_node (indexed=True). Non-indexed paths keep using shard-level DP partitioning via streaming. Remove the now- unused self._has_index plumbing. Tests: * New tests/collections/common/test_lhotse_indexed_partition.py: 28 tests (7 adapters x 4 world sizes {1,2,4,5}) asserting per-rank slices are pairwise disjoint and the union covers the manifest exactly once. * test_lhotse_multimodal_dataloading.py: delete 6 tests that asserted the removed non-partition-aware fallback paths, plus the 3 NeMoLazyShuffledRange tests now covered by lhotse-side test_indexing.py. Drop _has_index assertions from the surviving tests. 58/58 NeMo tests + 179/179 lhotse tests pass under nemo312-hf5 with PYTHONPATH=lhotse_resumable:NeMo_resumable. Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

End-to-end harness that exercises any train_ds-shaped Lhotse config: torchrun-launched per-rank entry dumps per-batch cut.id JSONL, saves state_dict at a configurable step, and a separate resumed-phase torchrun loads it and continues. Post-iteration consolidate.py verifies five properties: Q1 no cut.id appears on >1 (rank, worker) Q2 union of yielded IDs equals ground-truth enumeration Q3 per-rank cut sets are pairwise disjoint Q4 resumed cells match baseline tail bit-for-bit (off-by-one aware: state captures position AFTER yielding step K, so resumed[0] == baseline[K+1]) Q5 two independent runs with the same seed yield identical (rank, step) cut sets Plus 16 static pre-validation checks (seed types, indexed/stateful flags, idx-file presence, constant-time-leaves contract, mux weight sanity, multi_config flag propagation, bucketer buffer heuristic, text_field sanity). Caught the LazyIndexedSharIterator partition leak fixed in the companion lhotse commit (4072 AMI cuts duplicated across 8 DP ranks under iterable+indexed mode); now reports clean PASS on the 0909-en-only-id2 recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

…ss timer pre-emption patch (willfix) Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

…hotse-dataloader # Conflicts: # CLAUDE.md # docs/source/dataloaders.rst # nemo/collections/common/data/lhotse/nemo_adapters.py # nemo/collections/speechlm2/data/datamodule.py # nemo/collections/speechlm2/models/salm_automodel.py Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

copy-pr-bot · 2026-06-09T19:42:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kobenaxie · 2026-06-10T06:33:07Z

How to use lhotse dataloader to load Webdataset which contains wav and txt in tar file.

pzelasko · 2026-06-10T13:50:18Z

How to use lhotse dataloader to load Webdataset which contains wav and txt in tar file.

@kobenaxie To handle this you'll need to write a similar class to LazyNeMoTarredIterator to map your specific WDS format to a Cut and add a @data_type_parser to register that under some format name; you can also check the dataloading documentation in this PR which describes these things a little better than was available before.

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

…hotse-dataloader # Conflicts: # nemo/collections/speechlm2/models/salm_automodel.py # nemo/collections/speechlm2/parts/cp_helpers.py # nemo/collections/speechlm2/parts/encoder_chunking.py # tests/collections/speechlm2/test_salm_cp_helpers.py Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko added 17 commits May 1, 2026 16:20

First draft of indexed lhotse datasets integration + checkpointable d…

905bf99

…ataloader Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Support new Lhotse's indexed iterators across NeMo

48818f5

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

refactor read_batch

8a482e4

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

refactor/cleanup

086f0e3

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Merge branch 'main' into stateful-restorable-lhotse-dataloader

69ad897

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Documentation update to reflect indexed/checkpointable things + gener…

ad6861a

…al gap coverage Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Add supported for external index directory

ffc53c7

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

total token/examples logging; force individual GET instead of GetBatc…

d7b6855

…h; changes to allow byte-range reading from AIS Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Agentic skill for performing migration to the new dataloader

d057dee

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Proper lhotse shar indexing support and refactoring

ca40489

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

dataloder checkpoints save/load correct per-rank information; statele…

3e8dab8

…ss timer pre-emption patch (willfix) Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Fix dataloader_iter resumability on preemption

563a81e

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Support text-only data loading

993c246

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

github-actions Bot added core Changes to NeMo Core ASR common audio labels Jun 9, 2026

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Fix ShareGPT multimodal resumable dataloading

0f4a125

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread nemo/collections/common/data/lhotse/text_adapters.py Fixed

pzelasko added 3 commits June 12, 2026 08:55

Fix CodeQL review feedback

2ba49f9

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Fix common test failures

64fae40

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Fix CI checks

9f47392

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko force-pushed the stateful-restorable-lhotse-dataloader branch from dcdcb09 to 9f47392 Compare June 12, 2026 18:18

Fix remaining callback lint

c61b126

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko added the Run CICD label Jun 12, 2026

github-advanced-security AI found potential problems Jun 12, 2026

View reviewed changes

Comment thread nemo/collections/common/data/lhotse/_compat.py Fixed

Comment thread nemo/collections/common/data/lhotse/_compat.py Fixed

Comment thread nemo/collections/common/data/lhotse/_compat.py Fixed

pzelasko added 2 commits June 12, 2026 11:42

Address CodeQL compatibility comments

cb3d81d

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Document Lhotse compatibility shims

ae0a131

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resumable lhotse dataloader#15777

Resumable lhotse dataloader#15777
pzelasko wants to merge 25 commits into
mainfrom
stateful-restorable-lhotse-dataloader

pzelasko commented Jun 9, 2026

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kobenaxie commented Jun 10, 2026

Uh oh!

pzelasko commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants