layout |
---|
page |
EMR stands for Elastic MapReduce. We use EMR to run Hadoop jobs on large datasets. EMR spins up a cluster of Amazon instances that executes the job, then terminates. It is a cheap way to work with big data.
If you can do the steps below, you can also try it from the browser. The nice part about CLI is you can do interactive jobs more easily which means you don't have to set up local Hadoop to test
It is actually pretty simple to set up. Follow instructions here. The following tips are meant to supplement the instructions so follow the amazon instructions and go here if you get stuck.
-
The EMR CLI requires Ruby 1.8.7. This is an old version (current is above 2.0). For easy installation and management of these different versions, you can use Ruby Version Manager. See http://rvm.io/rvm/install
curl -sSL https://get.rvm.io | bash -s stable
installs RVM. The following line must also be in your .bashrc to activate rvmsource ~/.rvm/scripts/rvm
To install / use ruby 1.8.7 runrvm install 1.8.7-p374 && rvm use 1.8.7-p374@emr-cli --create
Thenruby -v
should be 1.8.7-p374 -
You must fill in
credentials.json
for requires several fields.
- access_id / private_key are in a credentials.csv file. Either you or Patrick have it. Ask if you can't find it.
- key-pair / key-pair-file: create a key-pair file via http://docs.aws.amazon.com/IAM/latest/UserGuide/ManagingUserCerts.html (note: you probably already have openssl if you have linux). You must '''give the pem file to Patrick''' so we can associate it with your user account.
- Use "log_uri": "s3://gt-big-data-club/logs",
- Use "region": "us-east-1"
A good way to test out if EMR will work for you is to do this tutorial with Google NGrams