We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python version: 3.11.10
Preprocessing file:
### model model_name_or_path: Qwen2.5-3B ### method stage: pt do_train: true finetuning_type: full ### dataset dataset: llm_train eval_dataset: llm_valid cutoff_len: 4096 overwrite_cache: true preprocessing_num_workers: 30 preprocessing_batch_size: 1000 tokenized_path: tokenized_data_2048 ### output output_dir: qwen2_out overwrite_output_dir: true
Training code
### model model_name_or_path: Qwen2.5-3B flash_attn : auto ### method stage: pt do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z0_config.json enable_liger_kernel: true ### dataset dataset: llm_train eval_dataset: llm_valid cutoff_len: 4096 overwrite_cache: false preprocessing_num_workers: 16 preprocessing_batch_size: 1000 tokenized_path: tokenized_data_2048 ### output output_dir: qwen2_out logging_steps: 1000 save_steps: 50000 save_total_limit: 5 plot_loss: true overwrite_output_dir: false report_to: wandb run_name: official_qwen_pre_training ### train per_device_train_batch_size: 3 gradient_accumulation_steps: 4 learning_rate: 5.0e-5 num_train_epochs: 1.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 disable_gradient_checkpointing: true ### eval per_device_eval_batch_size: 4 eval_strategy: steps eval_steps: 50000
I want to know if this ensures a data shuffling during the pre-training and if possible to locate this in the code, as I am unable to find it.
Thanks
No response
The text was updated successfully, but these errors were encountered:
It seems to be "streaming". Check LLaMA-Factory/src/llamafactory/data/loader.py line 249-250.
LLaMA-Factory/src/llamafactory/data/loader.py
if data_args.streaming: dataset = dataset.shuffle(buffer_size=data_args.buffer_size, seed=training_args.seed)
Sorry, something went wrong.
The data will be shuffled in pre-training
No branches or pull requests
Reminder
System Info
Python version: 3.11.10
Reproduction
Preprocessing file:
Training code
Expected behavior
I want to know if this ensures a data shuffling during the pre-training and if possible to locate this in the code, as I am unable to find it.
Thanks
Others
No response
The text was updated successfully, but these errors were encountered: