-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: get highest checkpoint instead of hard coded path #383
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Anh Uong <[email protected]>
Thanks for making a pull request! 😃 |
Signed-off-by: Anh Uong <[email protected]>
Thank you @anhuong for this PR to fix the unit tests. Definitely good to have EDIT: |
@fabianlim was curious does number of checkpoints saved gets affected when For example: (Testing log logging each step)
Sample dataset size is 10 and per_device_train_batch_size is 4 hence micro-batch per epoch is ~3. With
Log with
|
@Abhishek-TAMU @anhuong I cannot reproduce the problem. I run the exact same command that was provided above, and I have 5 checkpoints. In accordance to my reading of the [code] for
Also I cannot repro your loss, for small models, I need to use a much higher learning rate like 1e-3 to go down to 2ish on 5 epochs you can turn on
|
@fabianlim the error only occurs on transformers v4.46 so agreed not sure how our configuration would cause the number of checkpoints to save when transformers upgrades....That makes sense that logging wouldn't affect the change, I thought it would be easier for us to read and makes more sense to set logging_strategy and save_strategy to the same. But I agree I suspected it could have been something with the gradient_accumulation that caused the issue but I see Fabian you're saying that save_strategy=epoch should still save on each epoch....hmmm |
Our unit tests passed with transformesr v4.45 but only started failing with transformers v4.46 because it looks like something changed with how checkpoints are being saved now as described. You can see this recent run of unit tests that ran 8 hours ago: https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/11619798675/job/32360230590 I can recreate this unit test failure locally when running This solution is better to not hard-code in checkpoint-5 but would still be good to check that 5 checkpoints exist, which is why only certain tests fail and not all of them, like the FT ones don't fail because they just check that a checkpoint exists, not looking for a specific one. |
I verified when running unit tests that it is all in the gradient_accumulation setting. When this is set to 1, the expected number of checkpoints is created. When GA>1, the number of checkpoints will be less than the number of epochs even though |
Signed-off-by: Anh Uong <[email protected]>
5f3ff51
to
1f109fb
Compare
@anhuong I cannot reproduce this on NOTE: when I run tox i see an upper bound for the
Update: my test passed in
|
Sharing my test log with library version of tox -e py -- tests/test_sft_trainer.py::test_resume_training_from_checkpoint LOG
|
Sharing pip freeze o/p
|
Running ======================================== short test summary info ========================================
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint_with_flag_true - assert 9.666666666666666 == (3.6666666666666665 + 5)
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint_with_flag_false - assert 0 == 1
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint_with_flag_checkpoint_path_lora - assert 29959084032.0 == 22686858240.0
FAILED tests/test_sft_trainer.py::test_run_causallm_pt_and_inference - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[default] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[custom_target_modules] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[all_linear_target_modules] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_and_inference[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_small.jsonl] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_and_inference[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_small.json] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_pretokenized[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.jsonl] - AssertionError: assert 4 == 6
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_pretokenized[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.json] - AssertionError: assert 4 == 6
================== 12 failed, 37 passed, 2 skipped, 336 warnings in 121.46s (0:02:01) =================== You can see with print statements the checkpoints: {'loss': 17.8934, 'grad_norm': 22.092439651489258, 'learning_rate': 1e-05, 'epoch': 1.67}
{'loss': 17.9788, 'grad_norm': 22.23282241821289, 'learning_rate': 8.535533905932739e-06, 'epoch': 3.67}
{'train_runtime': 1.7212, 'train_samples_per_second': 29.049, 'train_steps_per_second': 2.905, 'train_tokens_per_second': 1661.61, 'train_loss': 22.376200675964355, 'epoch': 3.67}
['checkpoint-1', 'training_logs.jsonl', 'checkpoint-0', 'checkpoint-2'] The pip freeze for this run:
Compared to a run with ======================================== short test summary info ========================================
FAILED tests/test_sft_trainer.py::test_resume_training_from_checkpoint - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_pt_and_inference - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[default] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[custom_target_modules] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_lora_and_inference[all_linear_target_modules] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_and_inference[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_small.jsonl] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_and_inference[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_small.json] - AssertionError: assert 3 == 5
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_pretokenized[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.jsonl] - AssertionError: assert 4 == 6
FAILED tests/test_sft_trainer.py::test_run_causallm_ft_pretokenized[/Users/anhuong/github.com/anhuong/fms-hf-tuning/tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.json] - AssertionError: assert 4 == 6
=================== 9 failed, 40 passed, 2 skipped, 282 warnings in 118.76s (0:01:58) =================== Print checkpoints: {'loss': 6.8061, 'grad_norm': 8.93274211883545, 'learning_rate': 1e-05, 'epoch': 1.0}
{'loss': 3.3769, 'grad_norm': 5.7982940673828125, 'learning_rate': 5e-06, 'epoch': 2.0}
{'loss': 3.3822, 'grad_norm': 4.714280605316162, 'learning_rate': 0.0, 'epoch': 3.0}
{'train_runtime': 1.8403, 'train_samples_per_second': 27.169, 'train_steps_per_second': 2.717, 'train_tokens_per_second': 1554.079, 'train_loss': 4.064878463745117, 'epoch': 3.0}
['checkpoint-1', 'training_logs.jsonl', 'checkpoint-5', 'checkpoint-3'] And so previously it only succeeded because checkpoint-5 exists whereas now it is checkpoint-3 but the number of checkpoints remains the same. Before when I ran in the cluster, I was using a larger dataset with sample size 50 and did not see this behavior occur, 5 checkpoints were always saved. My cluster runs: $ python -m tuning.sft_trainer --model_name_or_path Maykeye/TinyLLama-v0 --training_data_path /app/twitter_complaints_small.json --output_dir /tmp/test-transformers-446-lora --num_train_epochs 5 --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --learning_rate 1e-5 --response_template "\n### Label:" --dataset_text_field "output" --torch_dtype "float32" --logging_steps 1 --save_strategy "epoch" --use_flash_attn false --per_device_eval_batch_size 4 --weight_decay 0 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --include_tokens_per_second true --packing false --peft_method lora
# GA=1, dataset_sample_size=10
/tmp/test-transformers-446-lora-ga-1:
added_tokens_info.json checkpoint-15 checkpoint-6 training_logs.jsonl
checkpoint-12 checkpoint-3 checkpoint-9
# GA=2, dataset_sample_size=10
/tmp/test-transformers-446-lora-ga-2:
added_tokens_info.json checkpoint-1 checkpoint-3 checkpoint-4 checkpoint-5 training_logs.jsonl
# GA=4, dataset_sample_size=10
/tmp/test-transformers-446-lora-no-logging:
added_tokens_info.json checkpoint-0 checkpoint-1 checkpoint-2 training_logs.jsonl
# GA=4, dataset_sample_size=50
/tmp/test-transformers-446-lora-no-logging-50-samples:
added_tokens_info.json checkpoint-15 checkpoint-6 training_logs.jsonl
checkpoint-13 checkpoint-3 checkpoint-9 In addition we see the training end early when GA=4 and dataset samples size=10 LogsCurrently training with a batch size of: 4
***** Running training *****
Num examples = 10
Num Epochs = 5
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 4
Total optimization steps = 5
Number of trainable parameters = 16,384
0%| | 0/5 [00:00<?, ?it/s]Saving model checkpoint to /tmp/test-transformers-446-lora-no-logging/checkpoint-0
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/8c7ff07ec91bbe08ba42634a8611deb028a77896/config.json
Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 4,
"hidden_act": "silu",
"hidden_size": 64,
"initializer_range": 0.02,
"intermediate_size": 256,
"max_position_embeddings": 2048,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 16,
"num_hidden_layers": 8,
"num_key_value_heads": 16,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"vocab_size": 32000
}
/home/tuning/.local/lib/python3.11/site-packages/peft/utils/save_and_load.py:257: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
tokenizer config file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-0/tokenizer_config.json
Special tokens file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-0/special_tokens_map.json
{'loss': 20.4898, 'grad_norm': 20.289756774902344, 'learning_rate': 1e-05, 'epoch': 1.67}
20%|██████████████ | 1/5 [00:10<00:41, 10.41s/it]Saving model checkpoint to /tmp/test-transformers-446-lora-no-logging/checkpoint-1
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/8c7ff07ec91bbe08ba42634a8611deb028a77896/config.json
Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 4,
"hidden_act": "silu",
"hidden_size": 64,
"initializer_range": 0.02,
"intermediate_size": 256,
"max_position_embeddings": 2048,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 16,
"num_hidden_layers": 8,
"num_key_value_heads": 16,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"vocab_size": 32000
}
/home/tuning/.local/lib/python3.11/site-packages/peft/utils/save_and_load.py:257: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
tokenizer config file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-1/tokenizer_config.json
Special tokens file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-1/special_tokens_map.json
Saving model checkpoint to /tmp/test-transformers-446-lora-no-logging/checkpoint-1
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/8c7ff07ec91bbe08ba42634a8611deb028a77896/config.json
Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 4,
"hidden_act": "silu",
"hidden_size": 64,
"initializer_range": 0.02,
"intermediate_size": 256,
"max_position_embeddings": 2048,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 16,
"num_hidden_layers": 8,
"num_key_value_heads": 16,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"vocab_size": 32000
}
/home/tuning/.local/lib/python3.11/site-packages/peft/utils/save_and_load.py:257: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
tokenizer config file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-1/tokenizer_config.json
Special tokens file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-1/special_tokens_map.json
{'loss': 20.9744, 'grad_norm': 19.00687599182129, 'learning_rate': 8.535533905932739e-06, 'epoch': 3.67}
40%|████████████████████████████ | 2/5 [00:10<00:13, 4.50s/it]Saving model checkpoint to /tmp/test-transformers-446-lora-no-logging/checkpoint-2
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/8c7ff07ec91bbe08ba42634a8611deb028a77896/config.json
Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 4,
"hidden_act": "silu",
"hidden_size": 64,
"initializer_range": 0.02,
"intermediate_size": 256,
"max_position_embeddings": 2048,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 16,
"num_hidden_layers": 8,
"num_key_value_heads": 16,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"vocab_size": 32000
}
/home/tuning/.local/lib/python3.11/site-packages/peft/utils/save_and_load.py:257: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
tokenizer config file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-2/tokenizer_config.json
Special tokens file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-2/special_tokens_map.json
Saving model checkpoint to /tmp/test-transformers-446-lora-no-logging/checkpoint-2
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
DEBUG:connectionpool.py:[https://huggingface.co:443](https://huggingface.co/) "HEAD /Maykeye/TinyLLama-v0/resolve/main/config.json HTTP/11" 200 0
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Maykeye--TinyLLama-v0/snapshots/8c7ff07ec91bbe08ba42634a8611deb028a77896/config.json
Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 4,
"hidden_act": "silu",
"hidden_size": 64,
"initializer_range": 0.02,
"intermediate_size": 256,
"max_position_embeddings": 2048,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 16,
"num_hidden_layers": 8,
"num_key_value_heads": 16,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"vocab_size": 32000
}
/home/tuning/.local/lib/python3.11/site-packages/peft/utils/save_and_load.py:257: UserWarning: Setting `save_embedding_layers` to `True` as the embedding layer has been resized during finetuning.
warnings.warn(
tokenizer config file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-2/tokenizer_config.json
Special tokens file saved in /tmp/test-transformers-446-lora-no-logging/checkpoint-2/special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 11.1269, 'train_samples_per_second': 4.494, 'train_steps_per_second': 0.449, 'train_tokens_per_second': 270.516, 'train_loss': 25.917766571044922, 'epoch': 3.67}
40%|████████████████████████████ | 2/5 [00:11<00:16, 5.56s/it] Pip freeze of 2.1.1-rc1 image running in cluster
|
@fabianlim With logging statements I found out |
Description of the change
Unit tests were failing on transformers v4.46 with errors:
assert 9.666666666666666 == (3.6666666666666665 + 5)
where the epochs being logged didn't matchcheckpoint-5
but the checkpoints were only saving 3 checkpoints, not matching the number of epochs.Because of this I updated our trainingArguments parameters to use
gradient_accumulation_steps=1
to create the number of checkpoints expected for the small dataset we are using andlogging_strategy="epoch"
so that the loss logs would print per each full epoch.When setting gradient_accumulation_steps>1 the loss values would be fractions of epochs. I suspect that because the logging was set to partial epochs that the checkpoints being saved may have also been off.This is more consistent with
save_strategy="epoch"
that we have set and I removedlogging_steps=1
since we are logging based on epochs.The loss logging and then a print statement I added that for
os.listdir(output_dir)
showed:As you can see only 3 checkpoints are saved including one for checkpoint-0, likely because of the partial epochs being logged.
With these changes, the loss logging and
os.listdir(output_dir)
shows:I'm not sure how the number is determined for
checkpoint-<number>
, but it's better to not hardcodecheckpoint-5
so instead I refactored our existing code to get the highest checkpoint and return it.Related issue number
How to verify the PR
I verified that unit tests pass with transformers v4.45 and v4.46.
Was the PR tested