Checkpoint feature via steps instead of epoch #724

mylesgoose · 2024-10-11T03:21:50Z

🚀 The feature, motivation and pitch

at the moment the scritp only saves via epoch. for large data sets this is quite bad.

Alternatives

i crated an alternative here here

Additional context

the script will now save at the specified interval during the traning. and mark the files or folder according to the step and epoch. also it fixes some of the errors found in the original logic

mreso · 2024-10-11T21:18:45Z

Hi @mylesgoose I think that could be a great idea. Can you share a bit how the interface would look after this integration?

mylesgoose · 2024-10-12T00:31:14Z

well i actually implemented it above. :-) @mreso
you could test for me if you like.
I have only test for fully sharded. the added feature is checkpoints_interval and max checkpoints to keep.
also in the interface there is a picture of a llama if you are gpu poor lol.
https://github.com/mylesgoose/llama-recipes/tree/llama-3.2-vision

i had to change quite allot of things so this would need to be tested by others before pull request.
I have not worked on the resume from checkpoint feature and the monitoring where we are up to in the train steps and then resuming form that point yet. but this already is quite handy if the train crashes during training at least you have a checkpoint saved during epoch. which by modification of the inputted training data you can then technically resume

torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning.py \
    --enable_fsdp \
    --lr 1e-5 \
    --num_epochs 1 \
    --batch_size_training 2 \
    --model_name meta-llama/Llama-3.2-11B-Vision-Instruct \
    --dist_checkpoint_root_folder ./finetuned_model \
    --dist_checkpoint_folder ./finetuned_model \
    --use_fast_kernels True \
    --dataset "custom_dataset" \
    --custom_dataset.test_split "test" \
    --custom_dataset.file "/home/myles/llama-recipes/recipes/quickstart/finetuning/datasets/json_dataset.py" \
    --run_validation True \
    --batching_strategy padding \
    --use_wandb True \
    --gradient_accumulation_steps 1 \
    --checkpoint_interval 5 \
    --max_checkpoints_to_keep 2 \
    --context_length 4096 \
    --gradient_clipping False \
    --gradient_clipping_threshold 1.0 \
    --max_train_step 0 \
    --max_eval_step 0 \
    --num_workers_dataloader 16 \
    --weight_decay 0.0 \
    --gamma 0.85 \
    --seed 42 \
    --use_fp16 False \
    --mixed_precision True \
    --val_batch_size 1 \
    --peft_method "lora" \
    --use_peft False \
    --from_peft_checkpoint "" \
    --output_dir "./finetuned_model" \
    --freeze_layers False \
    --num_freeze_layers 1 \
    --quantization None \
    --one_gpu False \
    --save_model True \
    --save_optimizer True \
    --save_metrics True \
    --flop_counter False \
    --flop_counter_start 3 \
    --use_profiler False \
    --profiler_dir "./finetuned_model/profiler/results"

def display_llama_art():
    global llama_art_printed
    if not llama_art_printed and (not torch.distributed.is_available() or torch.distributed.get_rank() == 0):
        llama_art = r"""
                               .                                                          
                              +=-                                                         
                             :*#+:             :==                                        
                             :#%#+.          :+*#+                                        
                              +%#+=---------+*#%#:                                        
                               =+==+++====-=+*%#:                                         
                               :-==+=+*++==---:                                           
                               -=%+=--::+%==-.                                            
                               .+@:---::@+%=:                                             
                               .:::*%#*:.: .                                              
                               .:-:.-+*#=..;                                              
                               .:-:_+*+_....                                              
                              :::------:.:.                                               
                             .--======--:                                                 
                             .--=++++=--.                 .:                              
                            .:-==++++=-:.   ....:::.......:=-::.                          
                            .:-==+++=-:.:--=========------::--===:                        
                           .::-==++=-:::-==+==++=====--===---::=++-.                      
                           .::-===+=--:-===++==++====---=-:---::=+-.                      
                           .:---+++=---===++++-======-=--:+----:.=+:                      
                           .::--=++=--====+++==--==-=---::+===-::==:                      
                           ..:-============+=------=---:.+===+-:.=::                      
                           ...::-============-:-:=----:.=++=+=-...:                       
                            ..::--====+======-:--=---::-+++++-:.                          
                             ...:::-:---=====::------:-+*++++=-                           
                              ....::::::---=-:::------++*++==+:                           
                                .:-==--=-----:::::-===*#*++++=:                           
                                .:-=++----==-:.:--++=-+#####+=-                           
                                .-=-++-.:==+=:  .:=++=-*%#**+-:                           
                                .:====::-=++=.   :=++==---==--::                          
                                .:-+=-:-==++-    :=++==+-:--:-:-.                         
                                :-===:=====-:     :-+==*=-==-::...                        
                               .:=+=:====---:     -+**++=-=++=-:...                       
 .:                             :---:-====--:     -=+*++*--*#+=::.                        
 .:                             :=--:+##*+-:.     ==++*+: .=*==-:.                        
 .:                              :--::+##*-      .==***=   -+=-::                         
 .:                               ::-.:=+-.      .=++#=    -++=:.                         
 .:                               :--. -=:      .=+++-     :+*=:                          
 .:                               --:. :=:.     =+#*:      =**-:                          
 .:                               -=:. =+=:   :+**=        +**+.                          
 .:                            ..:==:  =++-.+**#*:        :#*=.                           
 .:                            =+++-   +#*=. .....       -#+-:                            
 .:                             ....  -#*+-.                                              
    """
        print(llama_art)  # Print the art
        llama_art_printed = True  # Set the flag to True```

mreso · 2024-10-12T00:41:42Z

Great! Could you prepare the checkpointing pieces into a PR. Happy to review this.

mylesgoose · 2024-10-12T01:19:44Z

@mreso I think that fork is quite prepared. however I think it should be pulled into a separate development fork in your rep as still has some work. as discussed with the resuming from the saved checkpoint. i have tested running the saved checkpoint and converting to HF and it worked. also your convert to hf script did not work with the llama vision models so i made a new one included i have done a pr for that already. I also don't use conda environments i just compiled everything from source and used all the latest version of things like cuda and torch etc. so i will have to setup a conda env to test the pr so its going to be more reproducible with the programs you have listed in yoru requirments.txt file. I think the plan then would be to download your updated repo at main. modify the checkpoint file as per that one above and then push to new pr branch in my account then pull request that branch? you guys have had 5 pushes since i modified thous files so i will be able then to determine if its compatible with your changes.

mreso · 2024-10-12T01:25:23Z

Yes, separating the checkpointing changes from the dataset examples and testing within the right env will be a good idea. You can rebase your changes onto the current main if necessary. Let me know if you need help with that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint feature via steps instead of epoch #724

Checkpoint feature via steps instead of epoch #724

mylesgoose commented Oct 11, 2024

mreso commented Oct 11, 2024

mylesgoose commented Oct 12, 2024 •

edited

Loading

mreso commented Oct 12, 2024

mylesgoose commented Oct 12, 2024

mreso commented Oct 12, 2024

Checkpoint feature via steps instead of epoch #724

Checkpoint feature via steps instead of epoch #724

Comments

mylesgoose commented Oct 11, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

mreso commented Oct 11, 2024

mylesgoose commented Oct 12, 2024 • edited Loading

mreso commented Oct 12, 2024

mylesgoose commented Oct 12, 2024

mreso commented Oct 12, 2024

mylesgoose commented Oct 12, 2024 •

edited

Loading