Skip to content
amatteini edited this page Nov 7, 2012 · 2 revisions

Introduction

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.

LDIF can now be run on Amazon EMR.

Steps to run LDIF on EMR

1. Upload your LDIF configuration documents to S3

Instructions and tools can be found on the Amazon S3 web page.
All LDIF configuration documents need to be public.

2. Create an LDIF job flow

A job flow can be created and managed using different approaches.

2.1 Using the AWS Management Console
  1. Open the AWS Management Console (an AWS account is required)
  2. Create a new Custom JAR job flow
    1. set ldif-hadoop/ldif-hadoop-exe-0.5-jar-with-dependencies.jar as JAR Location
    2. set scheduler http://s3.amazonaws.com/your-bucket/your-schedulerConfig.xml as JAR Arguments
    3. configure the cluster using the wizard

The following screenshots show all the steps to create and run an LDIF job flow using the AWS console:








2.2 Using boto

boto is a integrated interface to infrastructural services offered by Amazon Web Services.

  1. install and configure boto (python is required)
  2. execute the following code:
import boto

# Init a connection to the EMR service
conn = boto.connect_emr()

# Create an EMR step that executes the LDIF jar against your configuration
step = boto.emr.step.JarStep(name='ldif-step', 
	jar='s3n://ldif-hadoop/ldif-hadoop-exe-0.5-jar-with-dependencies.jar', 
	step_args=['scheduler', 'http://s3.amazonaws.com/your-bucket/your-schedulerConfig.xml'])

# Create and run your custom EMR jobflow
conn.run_jobflow(name = 'ldif-flow', 
	num_instances=3, 
	master_instance_type='c1.medium', 
	slave_instance_type='c1.medium',  
	log_uri = s3n://your-bucket/your-logs-path/', 
	hadoop_version='0.20', steps=[step])

see also the boto.emr documentation

2.3 Using the Amazon EMR command line interface

See the Job Flow documentation on AWS site.