-
Notifications
You must be signed in to change notification settings - Fork 5
x Cluster
Much on this page is out of date. Use the Harvest Workflow document instead to understand how harvesting is accomplished.
Gulo can be run locally with small data sets using the REPL or on a Hadoop cluster with big data. This wiki page describes how to run it manually on Amazon Elastic MapReduce. Down the road we'll ride on pallet for automated provisioning and deployment.
The workflow is:
- Harvest Darwin Core Archives locally into a single CSV file and upload to S3
- Compile Gulo into a standalone JAR and upload to S3
- Create and run a MapReduce job using Amazon AWS console page
- Download MapReduce outputs locally and upload to CartoDB
The following buckets and folders are required on S3:
guloharvest
gulohfs/
occ
tax
loc
taxloc
gulojar
gulologs
gulotables
First let's harvest all Darwin Core Archives listed in the publishers table on CartoDB into a single CSV file and then upload it to Amazon S3.
Fire up your Clojure REPL:
$ lein repl
Then use these commands to harvest:
user=> (use `gulo.main)
user=> (use `gulo.harvest)
user=> (Harvest (publishers) "/mnt/hgfs/Data/vertnet/gulo/harvest")
When that's done, you'll have all the records from all the Darwin Core Archives in a single CSV file at /mnt/hgfs/Data/vertnet/gulo/harvest/dwc.csv
. Let's upload that to the guloharvest
bucket on S3 using s3cmd:
$ s3cmd put /mnt/hgfs/Data/vertnet/gulo/harvest/dwc.csv s3://guloharvest/dwc.csv
Next we need to compile Gulo and upload the resulting JAR to S3. Make sure you have lein installed and then:
$ lein do clean, deps, uberjar
That compiles Gulo into a standalone JAR in the target/
directory. Let's upload it to S3 using s3cmd:
$ s3cmd put target/gulo-0.1.0-SNAPSHOT-standalone.jar s3://gulojar/gulo-0.1.0-SNAPSHOT-standalone.jar
Now we have the data and JAR uploaded to S3, so let's create a MapReduce job. Go to the Elastic MapReduce console and click the Create New Job Flow
button.
The first step is Define Job Flow
where in theCreate a Job Flow
menu you select select Custom JAR
and then click Continue
.
The second step is Specify Parameters
where you set JAR Location
to gulojar/gulo-0.1.0-SNAPSHOT-standalone.jar
and JAR Arguments
to gulo.main.Shred s3n://guloharvest s3n://gulohfs s3n://gulotables
and then click Continue
.
The third step is Configure EC2 Instances
where you can keep the defaults and then click Continue
.
The fourth step is Advanced Options
where you can keep the defaults except for Amazon S3 Log Path (Optional)
which you set to s3n://gulologs/shred
and the click Continue
.
The fifth step is Bootstrap Options
where you can keep the defaults and then click Continue
.
The last step is Review
where you click the Create Job Flow
button which fires off the MapReduce job.
You can monitor the the status of the cluster using the Elastic MapReduce console. To monitor the status of the MapReduce job, click the cluster and in the Description
tab copy the Master Public DNS Name
into a browser window and append port :9100.
When the cluster finishes, you can download results which are located in the occ, tax, loc, taxloc
folders in the gulohfs
bucket. Again, just use s3cmd.
Finally, fire up your REPL again to prepare CartoDB tables for upload and wire them up after they get uploaded:
$ lein repl
In the REPL use these commands to prepare the table:
user=> (use `gulo.main)
user=> (PrepareTables)
That will zip up the tables into the /mnt/hgfs/Data/vertnet/gulo/tables
directory. You can upload each ZIP file directly to CartoDB using the dashboard. When they are all uploaded, make the tables public and then back in the REPL let's wire them up (build indexes, etc):
user=> (use `gulo.main)
user=> (WireTables)
BOOM. We're done!