-
Notifications
You must be signed in to change notification settings - Fork 13
Elastic MapReduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
LDIF can now be run on Amazon EMR.
Instructions and tools can be found on the Amazon S3 web page.
All LDIF configuration documents need to be public.
A job flow can be created and managed using different approaches.
- Open the AWS Management Console (an AWS account is required)
- Create a new Custom JAR job flow
- set
ldif-hadoop/ldif-hadoop-exe-0.5-jar-with-dependencies.jar
as JAR Location - set
scheduler http://s3.amazonaws.com/your-bucket/your-schedulerConfig.xml
as JAR Arguments - configure the cluster using the wizard
- set
The following screenshots show all the steps to create and run an LDIF job flow using the AWS console:
boto is a integrated interface to infrastructural services offered by Amazon Web Services.
- install and configure boto (python is required)
- execute the following code:
import boto
# Init a connection to the EMR service
conn = boto.connect_emr()
# Create an EMR step that executes the LDIF jar against your configuration
step = boto.emr.step.JarStep(name='ldif-step',
jar='s3n://ldif-hadoop/ldif-hadoop-exe-0.5-jar-with-dependencies.jar',
step_args=['scheduler', 'http://s3.amazonaws.com/your-bucket/your-schedulerConfig.xml'])
# Create and run your custom EMR jobflow
conn.run_jobflow(name = 'ldif-flow',
num_instances=3,
master_instance_type='c1.medium',
slave_instance_type='c1.medium',
log_uri = s3n://your-bucket/your-logs-path/',
hadoop_version='0.20', steps=[step])
see also the boto.emr documentation
See the Job Flow documentation on AWS site.