Run Spark application in aws emr
You can download the executables in releases page or by using the following script:
Macos:
curl -sSL https://github.com/weixu365/aws-emr-runner/releases/latest/download/aws-emr-runner-macos.bz2 | \
bunzip2 > aws-emr-runner | \
chmod +x aws-emr-runner
Linux:
curl -sSL https://github.com/weixu365/aws-emr-runner/releases/latest/download/aws-emr-runner-linux.bz2 | \
bunzip2 > aws-emr-runner | \
chmod +x aws-emr-runner
./aws-emr-runner validate -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
./aws-emr-runner resources -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
./aws-emr-runner run -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
./aws-emr-runner start-cluster -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
./aws-emr-runner terminate-cluster -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
- EMR Service Role. You can use either the default EMR role
EMR_DefaultRole
(created byaws emr create-default-roles
) or create a custom role in the resources stack
- S3 Bucket. Upload package files to this s3 bucket then run EMR steps using this package
- IAM Role and instance profile for emr instance
- (Optional) Lambda function to clean up idled clusters which has the same name
- Any other resource you want to put in the resource stack
- Load setting files
- Load config file and evaluate variables except resources variables
- Create or update resources stack
- Get resources from resources stack
- Evaluate all variables in config file
- Generate EMR steps from config file
- Create EMR cluster with defined steps
- Wait until all steps completed
- samples/enrichment-pipeline.yml
- samples/enrichment-pipeline.settings.yml
- Environment variable, e.g.
{{env.BUILD_NUMBER}}
- Values in a settings file, reference through
Values
prefix, e.g.{{Values.environment}}
- Resources in resource stack, e.g.
{{Resources.EMRInstanceProfile.PhysicalResourceId}}
- Predefined variables
{{EmrHadoopDebuggingStep}}
enable debugging in EMR{{AWSAccountID}}
The current aws account id
Support all the configs for aws nodejs sdk new EMR().runJobFlow()
method
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/EMR.html#runJobFlow-property
You could run any command or javascript file at any of the life cycle events, e.g. package your spark application at package
event
scripts:
package:
- make docker-package
or if only a single command
scripts:
package: make docker-package
or run aws emr create-default-roles
before deploying resources stack
scripts:
beforeDeployResources:
- aws emr create-default-roles
Here are all supported life cycle events in order:
For command run
and run-step
:
-
beforeDeployResources
-
afterDeployResources
-
beforeLoadResources
-
afterLoadResources
-
beforePackage
-
package
-
afterPackage
-
beforeUploadPackage
-
afterUploadPackage
-
beforeRun (only available in
run
command) -
afterRun (only available in
run
command) -
afterComplete (only available in
run
command) -
beforeSubmit (only available in
run-step
command) -
afterSubmit (only available in
run-step
command) -
afterStepComplete (only available in
run-step
command)
For command start-cluster
:
- beforeDeployResources
- afterDeployResources
- beforeLoadResources
- afterLoadResources
- beforeStartCluster
- beforeWaitForClusterStarted
- afterClusterStarted
For command terminate-cluster
:
- beforeTerminateCluster
- beforeWaitForClusterTerminated
- afterClusterTerminated
By using maxIdleMinutes
in the config file, aws-emr-runner will setup a scheduled task to check the idled clusters, and terminate cluster if it has idled longer than maxIdleMinutes
, e.g.
deploy:
...
maxIdleMinutes: 30 # Will automatically terminate the cluster if exceed max idle minutes
You can assume different roles by pre-defined rules using AWS::EMR::SecurityConfiguration
, e.g.
cluster:
...
SecurityConfiguration: '{{Resources.EMRSecurityConfiguration.PhysicalResourceId}}'
resources:
EMRSecurityConfiguration:
Type: AWS::EMR::SecurityConfiguration
Properties:
Name: sample-spark-pipeline-emr-securityconfiguration
SecurityConfiguration:
AuthorizationConfiguration:
EmrFsConfiguration:
RoleMappings:
-
Role: "arn:aws:iam::<account-id>:role/<role>"
IdentifierType: Prefix
Identifiers:
- "s3://your-bucket/"
-
validate
-
validate config file with settings
BUILD_NUMBER=123 bin/aws-emr-runner-macos validate -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
-
validate config file without settings
BUILD_NUMBER=123 bin/aws-emr-runner-macos validate -f samples/enrichment-pipeline.yml
-
-
resources
-
create resource stack
BUILD_NUMBER=123 bin/aws-emr-runner-macos resources -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
-
delete resource stack
BUILD_NUMBER=123 bin/aws-emr-runner-macos delete-resources -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
-
-
cluster
-
start cluster
BUILD_NUMBER=123 bin/aws-emr-runner-macos start-cluster -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml
-
delete cluster
BUILD_NUMBER=123 bin/aws-emr-runner-macos terminate-cluster -f samples/enrichment-pipeline.yml -s samples/enrichment-pipeline.settings.yml --cluster-id <id>
-
auto delete cluster
-
run-step
- run spark step on cluster
-
-
run
- wait for complete successfully
- wait for cancelled steps and fail
- Auto delete resource stack if can't updated Failed to create cloudformation changeset for 'sample-spark-application-resources-prod', caused by ValidationError: Stack:arn:aws:cloudformation:ap-southeast-2:807579936170:stack/sample-spark-application-resources-prod/91895dd0-03a5-11ee-a3e7-02dee24aa130 is in ROLLBACK_COMPLETE state and can not be updated.