This tool calculates Model Flops Utilization (MFU), for common Large Language Models (LLM). It analyzes training logs generated by DLLogger to provide insights into training efficiency.
- MFU Calculation: Computes the MFU based on average step time, model flops, batch size, and accelerator capabilities.
- DLLogger Integration: Parses DLLogger log files to extract relevant training data.
- Model Support: Includes pre-defined FLOPs per sample for popular LLM models like GPT-3, LLaMa2, and Mixtral.
- Accelerator Awareness: Supports various GPU/TPU types with default theoretical TFLOPS values.
python process_training_results.py --file <path_to_dllogger_file> \
--batch_size <batch_size> \
--num_accelerators <num_accelerators> \
[--model_type <model_type> | --model_flops <model_flops>] \
[--accelerator_type <accelerator_type> | --max_flops <max_flops>] \
[--start_step <start_step>] \
[--end_step <end_step>]
--file
: Path to the DLLogger log file.--batch_size
: Global batch size used during training.--num_accelerators
: Number of GPUs/TPUs used for training.
--model_type
: Type of LLM model used. Choose from predefined options (e.g., "gpt3-5b", "llama2-7b"). Currently supported models:- gpt3-5b
- gpt3-175b
- llama2-7b
- llama2-70b
- llama3-70b
- mixtral-7b
--model_flops
: Manually specify model FLOPs (forward + backward) per sample. Use this if your model is not listed in --model_type.--accelerator_type
: Type of accelerator used. Choose from predefined options (e.g., ""h100", "a100", "v5e", "v5p"). Currently supportes accelerators:- h100
- a100
- v5e
- v5p
--max_flops
: Manually specify the maximum theoretical TFLOPS of the accelerator. Use this in case your accelerator is not currently supported.--start_step
: Specify the starting step of the range to calculate the average training step time. Default to 10--end_step
: Specify the end step of the range to calculate the average training step time. Default to 30
Note: You must provide either --model_type or --model_flops and either --accelerator_type or --max_flops.
python3 process_training_results.py --file examples/dllogger.json \
--batch_size 2048 \
--num_accelerators 256 \
--model_type gpt3-175b \
--accelerator_type h100
This command analyzes the examples/dllogger.json file for a GPT3-175B model trained with a batch size of 2048 on 256 H100 accelerators.
# Theoretical FLOPS per fordward + backward step: 1.6E15 for my unsupported model
# Theoretical max flops for the hardware used 1000
python3 process_training_results.py --file examples/dllogger.json \
--batch_size 2048 \
--num_accelerators 256 \
--model_flops 1.6E15 \
--max_flops 1000
This command analyzes the examples/dllogger.json file for an unknown model that have a 1.6E15 number of FLOPS per step, trained with a batch size of 2048 on 256 X accelerators with max_flops of 1000 Tflops.
The script prints the following information to the console:
- Average step time
- TFLOPS per accelerator
- MFU
Accelerator | Max TFLOPS bf16 |
---|---|
h100 | 989 |
v5e | 197 |
v5p | 459 |
a100 | 312 |
Model | FLOPS per sample |
---|---|
gpt3-5b | 6.69e13 |
gpt3-175b | 2.2e15 |
llama2-7b | 1.89e14 |
llama2-70b | 1.82e15 |
llama3-70b | 3.94e15 |
mixtral-7b | 3.4e14 |