Handling Data Sharding for Snowflake Input in SageMaker PyTorchProcessor #5010

PallaviJagarlamudi · 2025-01-30T00:51:22Z

PallaviJagarlamudi
Jan 30, 2025

Issue:
I am using PyTorchProcessor in Amazon SageMaker to run inference. Previously, our data was stored in Amazon S3, and SageMaker automatically handled data sharding across instances. However, we have migrated the input data source to Snowflake, which contains 500 million rows (~2.1TB of data).

Since SageMaker does support ProcessingInput as a Snowflake table, I need to manually partition the data for efficient processing across multiple instances in a SageMaker Processing step.

Challenges
1. How to manually handle data partitioning/sharding for a large Snowflake table in SageMaker?
2. How to determine the number of instances and the current instance ID during a Processing step?
3. Can we leverage resourceconfig.json or environment variables (similar to SM_HOSTS in training) to distribute partitions across instances?
4. What is the best approach for using mod-based partitioning (MOD(HASH(id), num_instances) = current_instance_id) in this setup?

def handler(event, context):
    sagemaker_session = sagemaker.Session(sagemaker_client=sm_boto)
    pytorch_processor = PyTorchProcessor(framework_version=FRAMEWORK_VERSION,
                                         role=SAGEMAKER_ROLE,
                                         py_version="py311",
                                         instance_count=event['instance_count'],
                                         instance_type=event['instance_type'],
                                         volume_size_in_gb=event['volume_size'],
                                         sagemaker_session=sagemaker_session
                                         )

    pytorch_processor.run(job_name=event['name'],
                          code='propensity_full_predict.py',
                          source_dir=source_dir,
                          inputs=[
                              ProcessingInput(source=event['model'],
                                              destination='/opt/ml/processing/input/model',
                                              s3_data_type='S3Prefix'),
                              ProcessingInput(source=event['model_threshold'],
                                              destination='/opt/ml/processing/input/model_threshold',
                                              s3_data_type='S3Prefix')
                          ],
                          arguments=['--snowflake-inputs', event['snowflake_conn_details'],
                          outputs=[ProcessingOutput(source='/opt/ml/processing/output/',
                                                    s3_upload_mode='Continuous',
                                                    destination=event['output_path'])],
                          wait=False
                          )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling Data Sharding for Snowflake Input in SageMaker PyTorchProcessor #5010

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Handling Data Sharding for Snowflake Input in SageMaker PyTorchProcessor #5010

Uh oh!

Uh oh!

PallaviJagarlamudi Jan 30, 2025

Replies: 0 comments

PallaviJagarlamudi
Jan 30, 2025