Handling Data Sharding for Snowflake Input in SageMaker PyTorchProcessor #5010
Unanswered
PallaviJagarlamudi
asked this question in
Help
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Issue:
I am using PyTorchProcessor in Amazon SageMaker to run inference. Previously, our data was stored in Amazon S3, and SageMaker automatically handled data sharding across instances. However, we have migrated the input data source to Snowflake, which contains 500 million rows (~2.1TB of data).
Since SageMaker does support ProcessingInput as a Snowflake table, I need to manually partition the data for efficient processing across multiple instances in a SageMaker Processing step.
Challenges
1. How to manually handle data partitioning/sharding for a large Snowflake table in SageMaker?
2. How to determine the number of instances and the current instance ID during a Processing step?
3. Can we leverage resourceconfig.json or environment variables (similar to SM_HOSTS in training) to distribute partitions across instances?
4. What is the best approach for using mod-based partitioning (MOD(HASH(id), num_instances) = current_instance_id) in this setup?
Beta Was this translation helpful? Give feedback.
All reactions