This assignment will step you through the process of running a simple computation over a data set using Map/Reduce via mrjob. The goal of the assignment is to have you walk through the process of using, github, python, mrjob, and AWS and ensure you are setup with all the various tools and services.
- Getting started with Amazon AWS video tutorials
- Introduction to AWS training
- A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others
- A Survey on Cloud Provider Security Measures
Note: Keep track of the time necessary to run the process. For Linux/Mac users, you can use the time
command to compute this.
- Follow the instructions for running the program locally and measure the completion time.
- Follow the process for running the program on on on Amazon Elastic MapReduce and measure the completion time.
- Download the output from S3.
You can create users instead of using your root aws credentials If you do not have a user/group with access to EMR, you'll need to do the following procedure.
First, you need to setup a user to run EMR:
- Visit http://aws.amazon.com/ and sign up for an account.
- Select the "Identity and Access Management" (or IAM) from your console or visit https://console.aws.amazon.com/iam/home
- Select "Users" from the list on the left.
- Click on the "Create New Users"
- Enter a user name for yourself and create the user.
- The next screen will give you an option to download the credentials for this user. Do so and store them in a safe place. You will not be able to retrieve them again.
Second, you need to create a group with the right roles:
- Select "Groups" from the list on the left.
- Click on "Create New Group".
- Enter a name and click on "Next Step".
- Scroll down to "Amazon Elastic MapReduce Full Access" click on "Select".
- Once the policy document is displayed, click on "Next Step".
- Click on "Create Group" to create the group.
Third, you need to assign your user to the group:
- Select the check box next to your group.
- Click on the "Group Actions" drop-down menu and click on "Add Users to Group".
- Select your user by clicking on the check box.
- Click on "Add Users".
You need to configure mrjob to access your AWS account:
- Edit the mrjob.conf
- Locate the
#aws_access_key_id:
and#aws_secret_access_key:
lines. - Remove the hash (#) and add your AWS key and secret after the colon (:). You should have these from previously creating the user.
You need to create an output bucket on S3 for the results of your computation:
- Go to https://aws.amazon.com/ in your browser.
- Click on the 'S3' service link.
- Click on the 'Create Bucket' button.
- Enter a name and hit create.
Keep in mind that the bucket name is unique to all of Amazon. If you use some common name, it is likely to clash with other users. One suggestion is to use a common prefix (e.g. a domain name) for all your bucket names.
You must turn in a pull request containing the following:
- A copy of the output directory for the tag counter running locally (name the directory 'out').
- A copy of the output from S3 for the tag counter running on AWS (name the directory 'emr-out').
- How long did it take to run the process for each of these?
- How many
address
tags are there in the input? - Does the local version and EMR version give the same answer?
Please submit the answers to 3-5 in a text file called answers.txt