-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Sequence packing on large chat data #14929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
nicolo-domyn
wants to merge
9
commits into
NVIDIA-NeMo:main
Choose a base branch
from
nicolo-domyn:feat/chat-sequence-packing
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Sequence packing on large chat data #14929
nicolo-domyn
wants to merge
9
commits into
NVIDIA-NeMo:main
from
nicolo-domyn:feat/chat-sequence-packing
+313
−108
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
598cb96
to
f51f406
Compare
(cherry picked from commit 90a72dc) Signed-off-by: Nicolò Ruggeri <[email protected]> (cherry picked from commit 46ee4fb) Signed-off-by: Nicolò Ruggeri <[email protected]>
This allows to consider sequences as already padded (although they are not) , so that when we pad in downstream tasks (which should be done inside `GPTSFTPackedDataset.collate_fn`) the packed sequence doesn't exceed the maximum length due to unaccounted padding. (cherry picked from commit 656186c) Signed-off-by: Nicolò Ruggeri <[email protected]> (cherry picked from commit 788ff65) Signed-off-by: Nicolò Ruggeri <[email protected]>
…subsequences in a packed sequence to be a length divisible by `self.pad_seq_length_to_mult`. The functions `pad_thd_sequences_for_cp` and `generate_positional_ids_for_cp` are copied from transformer_engine, since they are not available in the official NeMo container venvs. (cherry picked from commit 430aa82) Signed-off-by: Nicolò Ruggeri <[email protected]> (cherry picked from commit bd95a66) Signed-off-by: Nicolò Ruggeri <[email protected]>
… packing and `return_cu_seqlen`. (cherry picked from commit 7adf3e4) Signed-off-by: Nicolò Ruggeri <[email protected]> (cherry picked from commit c5ff698) Signed-off-by: Nicolò Ruggeri <[email protected]>
…ult. (cherry picked from commit 52a3bd3) Signed-off-by: Nicolò Ruggeri <[email protected]> (cherry picked from commit 3bcea86) Signed-off-by: Nicolò Ruggeri <[email protected]>
(cherry picked from commit 15e1db7) Signed-off-by: Nicolò Ruggeri <[email protected]> (cherry picked from commit f339ea8) Signed-off-by: Nicolò Ruggeri <[email protected]>
(cherry picked from commit 0dc3622) Signed-off-by: Nicolò Ruggeri <[email protected]> (cherry picked from commit f51f406) Signed-off-by: Nicolò Ruggeri <[email protected]>
f51f406
to
45bfc54
Compare
Signed-off-by: nicolo-domyn <[email protected]>
…el.py Remove old file. Signed-off-by: Nicolò Ruggeri <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Important
The
Update branch
button must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Implement sequence packing for chat-like data and increase its efficiency for large-scale datasets.
Collection: llm, nlp, utils
Changelog
llm.gpt.chat.data.ChatDataModule
: methods_create_data
andprepare_data
llm.gpt.data.core.GPTSFTChatDataset
: thecollate_fn
seemed to assume that the sequences in the batch came prepadded when context parallel is active (see this comment). However, running the data pipeline this didn't seem to be the case, and sequences came unpadded. Consequently, the proposed change treats the input sequences as unpadded, and runs padding duringcollate_fn
.utils.sequence_packing_utils.create_hist
: for the same reason as above, the sequences coming into the function are not padded. If doing padding after packing, this may exceed the maximum length allowed of the packed sequence (due to the unaccounted padding). Introduce adivisibility_factor
argument insidecreate_hist
to take this into account. Padding is needed when doing packing with context parallel, as the sequence lengths need to be divisible by 2*CP.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel.get_batch_on_this_context_parallel_rank
: this now splits some inputs tensors across CP ranks (tokens
,labels
,loss_mask
, etc.) but leaves other untouched: thecu_seqlens
do not need splitting across ranksutils.sequence_packing_utils
:fill_packing_strategy
is around 10X faster now (on such large datasets moving from 1m to 6s) while being equivalent in resultfirst_fit_shuffle_with_heap
does not have quadratic complexity: in local tests it runs in around 30s as opposed to 14h usingfirst_fit
. Its functioning is very similar tofirst_fit_shuffle
, and the packing efficiency is equivalent (around 96% in local tests).Usage
Here you may need to temporarily override
FineTuningDataModule.setup
(where it definesself.max_train_samples
) andFineTuningDataModule._create_dataloader
where it defines (self.init_global_step
)GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
NLP collection reviewers: @MaximumEntropy @ericharper @ekmb, @yzhang123, @VahidooX, @vladgets, @okuchaiev
Additional Information
.bin .idx
format dataset (as opposed to currently holding all the data in memory and then saving a single.npy
) solves the problem. Once again, I have a local implementation that I can add to the PR if needed.nan
loss NVIDIA/Megatron-LM#1764. Without it, training with packing yields NaNs. The fix proposed there allows training with packing and context parallel correctly.All in all with these modifications (and a parallelised tokenisation script) I can run packing on 10M samples, across CP ranks, on chat data.