Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint feature via steps instead of epoch #724

Open
mylesgoose opened this issue Oct 11, 2024 · 5 comments
Open

Checkpoint feature via steps instead of epoch #724

mylesgoose opened this issue Oct 11, 2024 · 5 comments

Comments

@mylesgoose
Copy link

🚀 The feature, motivation and pitch

at the moment the scritp only saves via epoch. for large data sets this is quite bad.

Alternatives

i crated an alternative here here

Additional context

the script will now save at the specified interval during the traning. and mark the files or folder according to the step and epoch. also it fixes some of the errors found in the original logic

@mreso
Copy link
Contributor

mreso commented Oct 11, 2024

Hi @mylesgoose I think that could be a great idea. Can you share a bit how the interface would look after this integration?

@mylesgoose
Copy link
Author

mylesgoose commented Oct 12, 2024

well i actually implemented it above. :-) @mreso
you could test for me if you like.
I have only test for fully sharded. the added feature is checkpoints_interval and max checkpoints to keep.
also in the interface there is a picture of a llama if you are gpu poor lol.
https://github.com/mylesgoose/llama-recipes/tree/llama-3.2-vision

i had to change quite allot of things so this would need to be tested by others before pull request.
I have not worked on the resume from checkpoint feature and the monitoring where we are up to in the train steps and then resuming form that point yet. but this already is quite handy if the train crashes during training at least you have a checkpoint saved during epoch. which by modification of the inputted training data you can then technically resume

torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning.py \
    --enable_fsdp \
    --lr 1e-5 \
    --num_epochs 1 \
    --batch_size_training 2 \
    --model_name meta-llama/Llama-3.2-11B-Vision-Instruct \
    --dist_checkpoint_root_folder ./finetuned_model \
    --dist_checkpoint_folder ./finetuned_model \
    --use_fast_kernels True \
    --dataset "custom_dataset" \
    --custom_dataset.test_split "test" \
    --custom_dataset.file "/home/myles/llama-recipes/recipes/quickstart/finetuning/datasets/json_dataset.py" \
    --run_validation True \
    --batching_strategy padding \
    --use_wandb True \
    --gradient_accumulation_steps 1 \
    --checkpoint_interval 5 \
    --max_checkpoints_to_keep 2 \
    --context_length 4096 \
    --gradient_clipping False \
    --gradient_clipping_threshold 1.0 \
    --max_train_step 0 \
    --max_eval_step 0 \
    --num_workers_dataloader 16 \
    --weight_decay 0.0 \
    --gamma 0.85 \
    --seed 42 \
    --use_fp16 False \
    --mixed_precision True \
    --val_batch_size 1 \
    --peft_method "lora" \
    --use_peft False \
    --from_peft_checkpoint "" \
    --output_dir "./finetuned_model" \
    --freeze_layers False \
    --num_freeze_layers 1 \
    --quantization None \
    --one_gpu False \
    --save_model True \
    --save_optimizer True \
    --save_metrics True \
    --flop_counter False \
    --flop_counter_start 3 \
    --use_profiler False \
    --profiler_dir "./finetuned_model/profiler/results"
def display_llama_art():
    global llama_art_printed
    if not llama_art_printed and (not torch.distributed.is_available() or torch.distributed.get_rank() == 0):
        llama_art = r"""
                               .                                                          
                              +=-                                                         
                             :*#+:             :==                                        
                             :#%#+.          :+*#+                                        
                              +%#+=---------+*#%#:                                        
                               =+==+++====-=+*%#:                                         
                               :-==+=+*++==---:                                           
                               -=%+=--::+%==-.                                            
                               .+@:---::@+%=:                                             
                               .:::*%#*:.: .                                              
                               .:-:.-+*#=..;                                              
                               .:-:_+*+_....                                              
                              :::------:.:.                                               
                             .--======--:                                                 
                             .--=++++=--.                 .:                              
                            .:-==++++=-:.   ....:::.......:=-::.                          
                            .:-==+++=-:.:--=========------::--===:                        
                           .::-==++=-:::-==+==++=====--===---::=++-.                      
                           .::-===+=--:-===++==++====---=-:---::=+-.                      
                           .:---+++=---===++++-======-=--:+----:.=+:                      
                           .::--=++=--====+++==--==-=---::+===-::==:                      
                           ..:-============+=------=---:.+===+-:.=::                      
                           ...::-============-:-:=----:.=++=+=-...:                       
                            ..::--====+======-:--=---::-+++++-:.                          
                             ...:::-:---=====::------:-+*++++=-                           
                              ....::::::---=-:::------++*++==+:                           
                                .:-==--=-----:::::-===*#*++++=:                           
                                .:-=++----==-:.:--++=-+#####+=-                           
                                .-=-++-.:==+=:  .:=++=-*%#**+-:                           
                                .:====::-=++=.   :=++==---==--::                          
                                .:-+=-:-==++-    :=++==+-:--:-:-.                         
                                :-===:=====-:     :-+==*=-==-::...                        
                               .:=+=:====---:     -+**++=-=++=-:...                       
 .:                             :---:-====--:     -=+*++*--*#+=::.                        
 .:                             :=--:+##*+-:.     ==++*+: .=*==-:.                        
 .:                              :--::+##*-      .==***=   -+=-::                         
 .:                               ::-.:=+-.      .=++#=    -++=:.                         
 .:                               :--. -=:      .=+++-     :+*=:                          
 .:                               --:. :=:.     =+#*:      =**-:                          
 .:                               -=:. =+=:   :+**=        +**+.                          
 .:                            ..:==:  =++-.+**#*:        :#*=.                           
 .:                            =+++-   +#*=. .....       -#+-:                            
 .:                             ....  -#*+-.                                              
    """
        print(llama_art)  # Print the art
        llama_art_printed = True  # Set the flag to True```

@mreso
Copy link
Contributor

mreso commented Oct 12, 2024

Great! Could you prepare the checkpointing pieces into a PR. Happy to review this.

@mylesgoose
Copy link
Author

@mreso I think that fork is quite prepared. however I think it should be pulled into a separate development fork in your rep as still has some work. as discussed with the resuming from the saved checkpoint. i have tested running the saved checkpoint and converting to HF and it worked. also your convert to hf script did not work with the llama vision models so i made a new one included i have done a pr for that already. I also don't use conda environments i just compiled everything from source and used all the latest version of things like cuda and torch etc. so i will have to setup a conda env to test the pr so its going to be more reproducible with the programs you have listed in yoru requirments.txt file. I think the plan then would be to download your updated repo at main. modify the checkpoint file as per that one above and then push to new pr branch in my account then pull request that branch? you guys have had 5 pushes since i modified thous files so i will be able then to determine if its compatible with your changes.

@mreso
Copy link
Contributor

mreso commented Oct 12, 2024

Yes, separating the checkpointing changes from the dataset examples and testing within the right env will be a good idea. You can rebase your changes onto the current main if necessary. Let me know if you need help with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants