Skip to content

SageMaker training implementation for computer vision to offload JPEG decoding and augmentations on GPUs using NVIDIA DALI — allowing you to compare and reduce training time by addressing CPU bottlenecks caused by increasing data pre-processing load. Performance bottlenecks identified with SageMaker Debugger.

License

Notifications You must be signed in to change notification settings

aws-samples/sagemaker-cv-preprocessing-training-performance

Repository files navigation

sagemaker-cv-preprocessing-training-performance

This repository contains Amazon SageMaker training implementation with data pre-processing (decoding + augmentations) on both GPUs and CPUs for computer vision — allowing you to compare and reduce training time by addressing CPU bottlenecks caused by increasing data pre-processing load. This is achieved by GPU-accelerated JPEG image decoding and offloading of augmentation to GPUs using NVIDIA DALI. Performance bottlenecks and ystem utilizations metrics are compared using Amazon Sagemaker Debugger.

Module Description:

  • util_train.py: Launch Amazon Sagemaker PyTorch traininng jobs with your custom training script.
  • src/sm_augmentation_train-script.py: Custom training script to train models of different complexities (RESNET-18, RESNET-50, RESNET-152) with data pre-processing implementation for:
    • JPEG decoding and augmentation on CPUs using PyTorch Dataloader
    • JPEG decoding and augmentation on CPUs & GPUs using NVIDIA DALI
  • util_debugger.py: Extract system utilization metrics with SageMaker Debugger.

Run SageMaker training job with decoding and augmentation on GPU:

  • Parameters such as training data path, S3 bucket, epochs and other training hyperparameters can be adapted at util_train.py.
  • The custom custom training script used is src/sm_augmentation_train-script.py.
from util_debugger import get_sys_metric
from util_train import aug_exp_train
aug_exp_train(model_arch = 'RESNET50', 
              batch_size = '32', 
              aug_operator = 'dali-gpu', 
              instance_type='ml.p3.2xlarge',  
              curr_sm_role = 'to-be-added')
  • Note that this implementation at the moment is optimized for single-GPU training to address multi-core CPU bottlenecks. The DALI Decoder operation can be updated with improved usage of device_memory_padding and host_memory_padding for multi-GPU larger instances.

Experiment to compare bottlenecks:

  • Create an Amazon S3 bucket called sm-aug-test and upload the Imagenette dataset (download link).
  • Update your SageMaker execution role in the notebook sm_augmentation_train-script.py and run the notebook to compare seconds/epoch and system utilization for training jobs by toggling the following parameters:
    • instance_type (default: ml.p3.2xlarge)
    • model_arch (default: RESNET18)
    • batch_size (default: 32)
    • aug_load_factor (default: 12)
    • AUGMENTATION_APPROACHES (default: ['pytorch-cpu', 'dali-gpu'])
  • Comparison results using the above default parameter setup:
    • Seconds/ Epoch improvement of 72.59% in Amazon SageMaker training job by offloading JPEG decoding and heavy augmentation to GPU — addressing data pre-processing bottleneck to improve performance-cost ratio.
    • Using the above strategy, training time improvement is higher for lighter models like RESNET-18 (which causes more CPU bottlenecks) over heavier model such as RESNET-152 as the aug_load_factor is increased while keeping lower batch size of 32.
    • System utilization Histograms and CPU bottleneck Heatmaps are generated with SageMaker Debugger in the notebook. Profiler Report and other interactive visuals available on SageMaker Studio.
  • Further detailed results (based on different augmentation loads, batch sizes, and model complexities for training on 8-CPUs and 1-GPU) are available on request.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

SageMaker training implementation for computer vision to offload JPEG decoding and augmentations on GPUs using NVIDIA DALI — allowing you to compare and reduce training time by addressing CPU bottlenecks caused by increasing data pre-processing load. Performance bottlenecks identified with SageMaker Debugger.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published