We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(torchscale) yehuicheng@bdp-gpu04:~/torchscale/examples/fairseq$ torchrun --nproc_per_node=8 --master_port 29501 --nnodes=1 train.py /home/data/dataset/yehuicheng/LongNet_example/DNA_example/longnet_example --num-workers 0 --activation-fn gelu --share-decoder-input-output-embed --validate-interval-updates 1000 --save-interval-updates 1000 --no-epoch-checkpoints --memory-efficient-fp16 --fp16-init-scale 4 --arch transformer --task language_modeling --sample-break-mode none --tokens-per-sample 4096 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 --lr 5e-4 --lr-scheduler polynomial_decay --warmup-updates 750 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --batch-size 4 --update-freq 1 --required-batch-size-multiple 1 --total-num-update 50000 --max-update 50000 --seed 1 --ddp-backend=c10d --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] ***************************************** W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] *****************************************
The text was updated successfully, but these errors were encountered:
No branches or pull requests
(torchscale) yehuicheng@bdp-gpu04:~/torchscale/examples/fairseq$ torchrun --nproc_per_node=8 --master_port 29501 --nnodes=1 train.py /home/data/dataset/yehuicheng/LongNet_example/DNA_example/longnet_example --num-workers 0 --activation-fn gelu --share-decoder-input-output-embed --validate-interval-updates 1000 --save-interval-updates 1000 --no-epoch-checkpoints --memory-efficient-fp16 --fp16-init-scale 4 --arch transformer --task language_modeling --sample-break-mode none --tokens-per-sample 4096 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 --lr 5e-4 --lr-scheduler polynomial_decay --warmup-updates 750 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --batch-size 4 --update-freq 1 --required-batch-size-multiple 1 --total-num-update 50000 --max-update 50000 --seed 1 --ddp-backend=c10d --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779]
W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] *****************************************
W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] *****************************************
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
W1108 21:43:16.641655 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273819 closing signal SIGTERM
W1108 21:43:16.642491 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273820 closing signal SIGTERM
W1108 21:43:16.642741 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273821 closing signal SIGTERM
W1108 21:43:16.643247 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273822 closing signal SIGTERM
W1108 21:43:16.643435 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273823 closing signal SIGTERM
E1108 21:43:16.708592 140431967650432 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 2273818) of binary: /home/yehuicheng/miniconda3/envs/torchscale/bin/python3.8
Traceback (most recent call last):
File "/home/yehuicheng/miniconda3/envs/torchscale/bin/torchrun", line 8, in
sys.exit(main())
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
[1]:
time : 2024-11-08_21:43:16
host : bdp-gpu04.bdp.biosino.org
rank : 6 (local_rank: 6)
exitcode : 2 (pid: 2273828)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-11-08_21:43:16
host : bdp-gpu04.bdp.biosino.org
rank : 7 (local_rank: 7)
exitcode : 2 (pid: 2273830)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-11-08_21:43:16
host : bdp-gpu04.bdp.biosino.org
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 2273818)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: