sagemaker-cv-preprocessing-training-performance

This repository contains Amazon SageMaker training implementation with data pre-processing (decoding + augmentations) on both GPUs and CPUs for computer vision — allowing you to compare and reduce training time by addressing CPU bottlenecks caused by increasing data pre-processing load. This is achieved by GPU-accelerated JPEG image decoding and offloading of augmentation to GPUs using NVIDIA DALI. Performance bottlenecks and ystem utilizations metrics are compared using Amazon Sagemaker Debugger.

Module Description:

util_train.py: Launch Amazon Sagemaker PyTorch traininng jobs with your custom training script.
src/sm_augmentation_train-script.py: Custom training script to train models of different complexities (RESNET-18, RESNET-50, RESNET-152) with data pre-processing implementation for:
- JPEG decoding and augmentation on CPUs using PyTorch Dataloader
- JPEG decoding and augmentation on CPUs & GPUs using NVIDIA DALI
util_debugger.py: Extract system utilization metrics with SageMaker Debugger.

Run SageMaker training job with decoding and augmentation on GPU:

Parameters such as training data path, S3 bucket, epochs and other training hyperparameters can be adapted at util_train.py.
The custom custom training script used is src/sm_augmentation_train-script.py.

from util_debugger import get_sys_metric
from util_train import aug_exp_train
aug_exp_train(model_arch = 'RESNET50', 
              batch_size = '32', 
              aug_operator = 'dali-gpu', 
              instance_type='ml.p3.2xlarge',  
              curr_sm_role = 'to-be-added')

Note that this implementation at the moment is optimized for single-GPU training to address multi-core CPU bottlenecks. The DALI Decoder operation can be updated with improved usage of device_memory_padding and host_memory_padding for multi-GPU larger instances.

Experiment to compare bottlenecks:

Create an Amazon S3 bucket called sm-aug-test and upload the Imagenette dataset (download link).
Update your SageMaker execution role in the notebook sm_augmentation_train-script.py and run the notebook to compare seconds/epoch and system utilization for training jobs by toggling the following parameters:
- instance_type (default: ml.p3.2xlarge)
- model_arch (default: RESNET18)
- batch_size (default: 32)
- aug_load_factor (default: 12)
- AUGMENTATION_APPROACHES (default: ['pytorch-cpu', 'dali-gpu'])
Comparison results using the above default parameter setup:
- Seconds/ Epoch improvement of 72.59% in Amazon SageMaker training job by offloading JPEG decoding and heavy augmentation to GPU — addressing data pre-processing bottleneck to improve performance-cost ratio.
- Using the above strategy, training time improvement is higher for lighter models like RESNET-18 (which causes more CPU bottlenecks) over heavier model such as RESNET-152 as the aug_load_factor is increased while keeping lower batch size of 32.
- System utilization Histograms and CPU bottleneck Heatmaps are generated with SageMaker Debugger in the notebook. Profiler Report and other interactive visuals available on SageMaker Studio.
Further detailed results (based on different augmentation loads, batch sizes, and model complexities for training on 8-CPUs and 1-GPU) are available on request.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
sm_augmentation.ipynb		sm_augmentation.ipynb
util_debugger.py		util_debugger.py
util_train.py		util_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sagemaker-cv-preprocessing-training-performance

Module Description:

Run SageMaker training job with decoding and augmentation on GPU:

Experiment to compare bottlenecks:

Security

License

About

Releases

Contributors 2

Languages

License

aws-samples/sagemaker-cv-preprocessing-training-performance

Folders and files

Latest commit

History

Repository files navigation

sagemaker-cv-preprocessing-training-performance

Module Description:

Run SageMaker training job with decoding and augmentation on GPU:

Experiment to compare bottlenecks:

Security

License

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Contributors 2

Languages