Resumable lhotse dataloader#15777
Draft
pzelasko wants to merge 25 commits into
Draft
Conversation
…ataloader Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…al gap coverage Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…h; changes to allow byte-range reading from AIS Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
dataloader.py: _maybe_init_main_process_for_iterable eagerly calls
worker_init_fn(0, ...) when num_workers=0 so the iterable path with
no worker subprocesses still gets DP×worker partition wired up.
Wired into both get_lhotse_dataloader_from_{single,multi}_config
iterable branches.
ASR + SpeechLM2 dataset wrappers: ais_prefer_individual ->
ais_force_individual; tracks the lhotse-side rename. The
USE_AIS_INDIVIDUAL_GETS env-var contract is unchanged.
Skill docs updated for: the rename, the new byte-range shar_ptr
fallback in AISBatchLoader, iterable-mode partition semantics, the
0-byte .idx race repro + atomic _write_index fix, and the
prefetch-manifests / prefetch-indexes co-design (merged into the
manifest prefetch script so .idx sidecars land at the resolve_idx_path
slot the rewritten YAML will look up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…r; dedupe against lhotse primitives
The bug: every indexed adapter's _iter_indexed iterated range(0, total_len)
with no get_worker_partition() call, so under multi-rank training every DP
rank x DataLoader worker yielded the same item sequence. Observed in the
0909-id sweep: 32 ranks all skipped the same 7 bad-path librilight cuts
instead of 32x7=224 disjoint ones (full diagnosis in
agent-debug-workspace/0909-en-only-id-4node/DIAGNOSIS.md).
Adapter refactor (7 leaves now partition-aware):
* LazyNeMoTarredIterator (nemo_adapters.py)
* LazyParquetIterator (nemo_adapters.py)
* LhotseTextJsonlAdapter, NeMoSFTJsonlAdapter (text_adapters.py)
* NeMoMultimodalConversationJsonlAdapter (text_adapters.py)
* NeMoMultimodalConversationShareGPTJsonlAdapter (text_adapters.py)
* NeMoMultimodalConversationShareGPTWebdatasetAdapter (text_adapters.py)
Each holds self._iter_state = PartitionedIndexedIterator() and replaces its
local _position/_restored/range(start, n) loop with a one-line delegation:
``for global_idx in self._iter_state.iterate(self._total_len): ...``.
state_dict/load_state_dict forward to the helper plus the adapter's own
epoch counter. The 8th indexed adapter (LazyNeMoIterator) was already
correct because it delegates to lhotse's LazyIndexedManifestIterator.
Dedup against lhotse primitives (indexed_adapters.py, -170 LOC):
* Drop local LazyShuffledRange and IndexedJSONLReader; use the lhotse-side
classes (lhotse.indexing.LazyShuffledRange / IndexedJsonlReader).
* Drop create_index; use lhotse.indexing.create_jsonl_index.
* _load_index now delegates the offsets-loading to lhotse.indexing.read_index
and layers only the NeMo-specific validation (file-size cross-check +
legacy-format sentinel handling) on top.
Kept: resolve_idx_path, IndexedTarSampleReader, IndexedTarMemberReader,
create_tar_index (NeMo-style tar with basename-grouped members).
Remove non-partition-aware index code paths (user directive):
* Delete _iter_jsonl_indexed in NeMoMultimodalConversationShareGPTJsonlAdapter
and _iter_indexed in NeMoMultimodalConversationShareGPTWebdatasetAdapter.
Both used .idx files for shuffling without partitioning the result. The
__iter__ dispatch now routes index-driven access exclusively through the
partition-aware _iter_indexed_node (indexed=True). Non-indexed paths
keep using shard-level DP partitioning via streaming. Remove the now-
unused self._has_index plumbing.
Tests:
* New tests/collections/common/test_lhotse_indexed_partition.py: 28 tests
(7 adapters x 4 world sizes {1,2,4,5}) asserting per-rank slices are
pairwise disjoint and the union covers the manifest exactly once.
* test_lhotse_multimodal_dataloading.py: delete 6 tests that asserted the
removed non-partition-aware fallback paths, plus the 3 NeMoLazyShuffledRange
tests now covered by lhotse-side test_indexing.py. Drop _has_index
assertions from the surviving tests.
58/58 NeMo tests + 179/179 lhotse tests pass under nemo312-hf5 with
PYTHONPATH=lhotse_resumable:NeMo_resumable.
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
End-to-end harness that exercises any train_ds-shaped Lhotse config:
torchrun-launched per-rank entry dumps per-batch cut.id JSONL, saves
state_dict at a configurable step, and a separate resumed-phase
torchrun loads it and continues. Post-iteration consolidate.py
verifies five properties:
Q1 no cut.id appears on >1 (rank, worker)
Q2 union of yielded IDs equals ground-truth enumeration
Q3 per-rank cut sets are pairwise disjoint
Q4 resumed cells match baseline tail bit-for-bit (off-by-one
aware: state captures position AFTER yielding step K, so
resumed[0] == baseline[K+1])
Q5 two independent runs with the same seed yield identical
(rank, step) cut sets
Plus 16 static pre-validation checks (seed types, indexed/stateful
flags, idx-file presence, constant-time-leaves contract, mux weight
sanity, multi_config flag propagation, bucketer buffer heuristic,
text_field sanity).
Caught the LazyIndexedSharIterator partition leak fixed in the
companion lhotse commit (4072 AMI cuts duplicated across 8 DP
ranks under iterable+indexed mode); now reports clean PASS on the
0909-en-only-id2 recipe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…ss timer pre-emption patch (willfix) Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…hotse-dataloader # Conflicts: # CLAUDE.md # docs/source/dataloaders.rst # nemo/collections/common/data/lhotse/nemo_adapters.py # nemo/collections/speechlm2/data/datamodule.py # nemo/collections/speechlm2/models/salm_automodel.py Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
|
How to use lhotse dataloader to load Webdataset which contains |
Collaborator
Author
@kobenaxie To handle this you'll need to write a similar class to LazyNeMoTarredIterator to map your specific WDS format to a Cut and add a @data_type_parser to register that under some format name; you can also check the dataloading documentation in this PR which describes these things a little better than was available before. |
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
…hotse-dataloader # Conflicts: # nemo/collections/speechlm2/models/salm_automodel.py # nemo/collections/speechlm2/parts/cp_helpers.py # nemo/collections/speechlm2/parts/encoder_chunking.py # tests/collections/speechlm2/test_salm_cp_helpers.py Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
dcdcb09 to
9f47392
Compare
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
This is a pretty large PR using a new dataloader design in Lhotse. See the docs and tutorials in the related lhotse PR (below) to understand the design.
Unlike previous approaches it allows quick resumption, 100% determinism, and improved sampling randomness for sharded / tarred / lhotse shar data. I noticed it helps to improve the results for models trained on many datasets blended using mux - sometimes a lot, depending on how large is the data and how it's physically sharded.
I've been validating it for a while and I managed to train some successful models using it, but I might do some more hardening before merging.
From NeMo Speech user perspective it's not invasive: it will work with existing data and training configs, and requires only to build binary indexes for manifests, and adjust a few settings in the dataloader config.
There is a dedicated agent skill for migrating existing training recipe to new the dataloader.
Related Lhotse PR: lhotse-speech/lhotse#1570
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information