x VertNet data architecture HOWTO

Much on this architecture page is obsolete. Use the Harvest workflow document instead to understand how harvesting is accomplished.

Good to have installed/configured

ec2 command line tools: download and install from http://aws.amazon.com/developertools/351
echo "export AWS_ACCESS_KEY=<vertnet key>" >> ~/.bashrc - make sure to include the vertnet key
echo "export AWS_SECRET_KEY=<vertnet secret key>" >> ~/.bashrc - make sure to include the secret key

Launching EC2 instance with VertNet AMI

Assuming you've go the ec2 tools installed, you can launch the harvest/bulkload instance in one go:

ec2-run-instances ami-5cebac35 -n 1 -k vertnet --instance-type m3.xlarge -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY

Here's that translated into words:

ec2-run-instances - command line tool for launching an instance
ami-5cebac35 - the harvest/bulkload AMI, all configured and ready to go
-n 1 - number of instances
-k vertnet - keypair name - corresponds to the file vertnet.pem that you'll need later to log in
--instance-type m3.xlarge - instance type chosen for optimized bulkloading
-O $AWS_ACCESS_KEY - Vernet access key
-W $AWS_SECRET_KEY - Vertnet secret key

You can also launch it from the EC2 admin console. For any options not specified here, just use the defaults.

To get the DNS address of your instance, run ec2-describe-instances -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY.

Login using something like this: ssh -i ~/.ssh/vertnet.pem [email protected].

Rebuilding the VertNet AMI

If you want to rebuild the VertNet AMI from scratch, launch a generic Ubuntu 12.0.4 machine:

ec2-run-instances ami-5cebac35 -n 1 -k vertnet --instance-type m3.xlarge -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY

The only difference is the AMI id. Once it's running, use the ec2-bootstrap script to configure the machine, including installing Java and a few handy utility programs, cloning project files, etc. You must be present for the end of the script so that you can supply the required AWS and CartoDB credentials.

Harvesting

Once you have your instance running, be sure to update Gulo to the latest version: git pull origin develop.

Then, double-check that the EBS volume is mounted to /mnt/beast and is owned by ubuntu.

sudo mkdir /mnt/beast
sudo mkfs -t ext3 /dev/xvdb
sudo mount /dev/xvdb /mnt/beast
sudo chown ubuntu:ubuntu /mnt/beast

Then open a screen and launch your REPL:

screen -m
lein repl

From the REPL, use the harvest namespace, sync the resource tables (if necessary), then run harvest-all:

(use 'gulo.harvest)
(in-ns 'gulo.harvest)

(sync-resource-table)
(harvest-all "/mnt/beast")
;; "/mnt/beast" is the local output location

To process only a few resources, it's easiest just to store them in a text file, one to a line. Then call (harvest-all "/mnt/beast" :path-file "/home/ubuntu/resource_list.txt"). If you have the resource list in your REPL, use :path-coll and pass in the collection to harvest-all.

The output will be uploaded to Google Cloud Storage via a python script. Look here for the current staging location:

https://cloud.google.com/console#/project/522126137979/storage/vn-staging/data/

You can also use the :sync flag with harvest-all, but it's not a bad idea to make sure syncing worked correctly before launching a long harvesting process.

For a small number of resources, harvesting will take a few minutes. If you're harvesting everything, just let it run in its screen until it finishes.

Statistics views

For the carrousel on the VertNet portal, we need to generate a few stats about the fully harvested data set. The gulo.views namespace handles this for us, and the gulo.main.RunStats defmain kicks off all jobs.

To run the stats queries, we need a Hadoop cluster. The command below can be used to launch it.

instancecount=5; elastic-mapreduce --create --alive --name dev --availability-zone us-east-1d --ami-version 2.0.5 --instance-group master --instance-type m2.4xlarge --instance-count 1 --bid-price 0.75 --instance-group core --instance-type m2.4xlarge --instance-count $instancecount --bid-price 0.75 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/add-swap --args 2048 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-s,mapred.tasktracker.map.tasks.maximum=30,-s,mapred.tasktracker.reduce.tasks.maximum=24,-s,mapred.reduce.tasks=$((24*$instancecount))" --bootstrap-action s3://vnproject/bootstrap-actions/gulo/bootstrap.sh

Five slave instances gives us a good balance of cost ($0.75 bid price, usually $0.16/hr in practice) and speed, and the us-east-1d availability zone has stable spot prices. If us-east-1d starts acting up and your cluster won't launch because prices are too high, try us-east-1e as a backup or raise the bid price for both master and slave instances. Finally, note that the bootstrap.sh script at the end of the command sets up a few convenient commands we'll use below.

Once your cluster is running, you need to configure it for the stats queries. Follow these steps in order:

Clone the gulo repo from Github. The gulo command should be available through .bashrc. Run it to clone the repo.
Install leiningen: run li (also from .bashrc).
Set up credentials: On the instance, run this bash script. Have your CartoDB and AWS credentials handy for copy/paste.
Get dependencies, compile, and uberjar the repo using the uj command (also from .bashrc.

Now you're ready to run the stats queries!

Launch a REPL for use with Hadoop: hadoop jar target/gulo-0.1.0-SNAPSHOT-standalone.jar clojure.main. Then start the stats queries:

(use 'gulo.main)
(in-ns 'gulo.main)

(RunStats "s3n://vnproject/data/staging/*" "s3n://vnproject/stats")

The results will be stored at s3n://vnproject/stats in directories corresponding to the name of the stats queries. Some of the results are a single number (e.g. number of records). Thus most partfiles for those queries will be empty. Look for the one file larger than 70 bytes. The remaining stats (e.g. record count by country) will be scattered across many textline part files. The easiest thing to do in that case is to download all the files using s3cmd and cat the files into one:

cd /tmp/
s3cmd get s3://vnproject/stats/2013-06-04/total-recs-by-country
cat total-recs-by-country/part* | sort > total-recs-by-country.txt

And you're done!

Bulkloading

The easiest thing is just to follow the instructions in bulkload.sh in the webapp project.

CartoDB 4-table schema

The final element needed for the VertNet Portal is a set of tables in CartoDB. These are produced using the Teratorn project's Shred defmain.

Launching cluster

The easiest way to get this running is to use Elastic Mapreduce. First set up a credentials.json in a directory of your choosing. I use the gulo project directory. From that directory, run this command to launch a cluster with one 4x.xlarge slave node:

elastic-mapreduce --create --alive --name dev --availability-zone us-east-1d --ami-version 2.0.5 \
--instance-group master --instance-type m2.4xlarge --instance-count 1 --bid-price 0.75 \
--instance-group core --instance-type m2.4xlarge --instance-count 1 --bid-price 0.75 \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/add-swap \
--args 2048 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "-s,mapred.tasktracker.map.tasks.maximum=30,-s,mapred.tasktracker.reduce.tasks.maximum=24" \
--bootstrap-action s3://vnproject/bootstrap-actions/gulo/bootstrap.sh

Configuring cluster

The EMR bootstrap script creates a few handy shortcuts for handling lein, dependencies, etc.

# Install lein
li

# Clone Gulo
gulo

# Clone Teratorn
teratorn

Gulo is a dependency, and the easiest way to get it installed is to clone the project and install it on the cluster. You could get it from Clojars, but the project changes pretty quickly and the Clojars version doesn't /get updated very frequently.

# Install Gulo
cd ~/
cd gulo

Now you need to set up credentials for CartoDB and AWS. Run this bash script and have your CartoDB/AWS credentials handy. It adds credentials files to the Gulo project, and configures s3cmd.

Ok, so now you can install Gulo:

cd gulo
lein install
cd ~/

That's it for config!

Running teratorn

hadoop jar target/teratorn-0.1.0-SNAPSHOT-standalone.jar teratorn.vertnet.Shred "s3n://vnproject/data/staging/*"

Boom! Shouldn't take more than 20 minutes for 178k records.

Uploading to CartoDB

So now we've got a bunch of textline part files sitting in /tmp

mkdir /mnt/beast
hadoop fs -copyToLocal /tmp/ /mnt/beast/

# add headers to output csv files
echo "tax_uuid     scientificname     kingdom     phylum     classs     order     family     genus" > /mnt/beast/tax.csv
echo "loc_uuid     lat     lon" > /mnt/beast/loc.csv
echo "taxloc_uuid     tax_uuid     loc_uuid" > /mnt/beast/taxloc.csv

# for the occurrence table, make sure you've got the latest version of the occurrence fields
wget https://raw.github.com/VertNet/webapp/develop/tools/search/header.tsv

# pre-pend uuids
echo "taxloc_uuid	tax_uuid	loc_uuid	occ_uuid" > /mnt/beast/occ.csv

# finalize occ table header
cat header.tsv >> /mnt/beast/occ.csv

# add output data to csv files
table="tax"; cat /mnt/beast/tmp/$table/part-* >> /mnt/beast/$table.csv
table="loc"; cat /tmp/$table/part-* >> /mnt/beast/$table.csv
table="tax-loc"; cat /tmp/$table/part-* >> /mnt/beast/$table.csv
table="occ"; cat /tmp/$table/part-* >> /mnt/beast/$table.csv

# zip things up
cd /mnt/beast
mkdir tables
table="tax"; zip tables/$table.zip $table.csv
table="loc"; zip tables/$table.zip $table.csv
table="tax-loc"; zip tables/$table.zip $table.csv
table="occ"; zip tables/$table.zip $table.csv

# put everything on S3
s3cmd -P put /mnt/beast/tables/*.zip s3://vnproject/tables/

S3cmd will print out all the public URLs of the different zip files. Use those to import the files into CartoDB at the CartoDB dashboard. This will change once we start using the JDBC tap, but for now it works fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly