-
Notifications
You must be signed in to change notification settings - Fork 5
x VertNet data architecture HOWTO
Much on this architecture page is obsolete. Use the Harvest workflow document instead to understand how harvesting is accomplished.
- ec2 command line tools: download and install from
http://aws.amazon.com/developertools/351
-
echo "export AWS_ACCESS_KEY=<vertnet key>" >> ~/.bashrc
- make sure to include the vertnet key -
echo "export AWS_SECRET_KEY=<vertnet secret key>" >> ~/.bashrc
- make sure to include the secret key
Assuming you've go the ec2 tools installed, you can launch the harvest/bulkload instance in one go:
ec2-run-instances ami-5cebac35 -n 1 -k vertnet --instance-type m3.xlarge -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY
Here's that translated into words:
ec2-run-instances
- command line tool for launching an instance
ami-5cebac35
- the harvest/bulkload AMI, all configured and ready to go
-n 1
- number of instances
-k vertnet
- keypair name - corresponds to the file vertnet.pem
that you'll need later to log in
--instance-type m3.xlarge
- instance type chosen for optimized bulkloading
-O $AWS_ACCESS_KEY
- Vernet access key
-W $AWS_SECRET_KEY
- Vertnet secret key
You can also launch it from the EC2 admin console. For any options not specified here, just use the defaults.
To get the DNS address of your instance, run ec2-describe-instances -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY
.
Login using something like this: ssh -i ~/.ssh/vertnet.pem [email protected]
.
If you want to rebuild the VertNet AMI from scratch, launch a generic Ubuntu 12.0.4 machine:
ec2-run-instances ami-5cebac35 -n 1 -k vertnet --instance-type m3.xlarge -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY
The only difference is the AMI id. Once it's running, use the ec2-bootstrap script to configure the machine, including installing Java and a few handy utility programs, cloning project files, etc. You must be present for the end of the script so that you can supply the required AWS and CartoDB credentials.
Once you have your instance running, be sure to update Gulo to the latest version: git pull origin develop
.
Then, double-check that the EBS volume is mounted to /mnt/beast
and is owned by ubuntu
.
sudo mkdir /mnt/beast
sudo mkfs -t ext3 /dev/xvdb
sudo mount /dev/xvdb /mnt/beast
sudo chown ubuntu:ubuntu /mnt/beast
Then open a screen and launch your REPL:
screen -m
lein repl
From the REPL, use
the harvest namespace, sync the resource tables (if necessary), then run harvest-all
:
(use 'gulo.harvest)
(in-ns 'gulo.harvest)
(sync-resource-table)
(harvest-all "/mnt/beast")
;; "/mnt/beast" is the local output location
To process only a few resources, it's easiest just to store them in a text file, one to a line. Then call (harvest-all "/mnt/beast" :path-file "/home/ubuntu/resource_list.txt")
. If you have the resource list in your REPL, use :path-coll
and pass in the collection to harvest-all
.
The output will be uploaded to Google Cloud Storage via a python script. Look here for the current staging location:
https://cloud.google.com/console#/project/522126137979/storage/vn-staging/data/
You can also use the :sync
flag with harvest-all
, but it's not a bad idea to make sure syncing worked correctly before launching a long harvesting process.
For a small number of resources, harvesting will take a few minutes. If you're harvesting everything, just let it run in its screen until it finishes.
For the carrousel on the VertNet portal, we need to generate a few stats about the fully harvested data set. The gulo.views
namespace handles this for us, and the gulo.main.RunStats
defmain kicks off all jobs.
To run the stats queries, we need a Hadoop cluster. The command below can be used to launch it.
instancecount=5; elastic-mapreduce --create --alive --name dev --availability-zone us-east-1d --ami-version 2.0.5 --instance-group master --instance-type m2.4xlarge --instance-count 1 --bid-price 0.75 --instance-group core --instance-type m2.4xlarge --instance-count $instancecount --bid-price 0.75 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/add-swap --args 2048 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-s,mapred.tasktracker.map.tasks.maximum=30,-s,mapred.tasktracker.reduce.tasks.maximum=24,-s,mapred.reduce.tasks=$((24*$instancecount))" --bootstrap-action s3://vnproject/bootstrap-actions/gulo/bootstrap.sh
Five slave instances gives us a good balance of cost ($0.75 bid price, usually $0.16/hr in practice) and speed, and the us-east-1d
availability zone has stable spot prices. If us-east-1d
starts acting up and your cluster won't launch because prices are too high, try us-east-1e
as a backup or raise the bid price for both master and slave instances. Finally, note that the bootstrap.sh
script at the end of the command sets up a few convenient commands we'll use below.
Once your cluster is running, you need to configure it for the stats queries. Follow these steps in order:
- Clone the
gulo
repo from Github. Thegulo
command should be available through.bashrc
. Run it to clone the repo. - Install leiningen: run
li
(also from.bashrc
). - Set up credentials: On the instance, run this bash script. Have your CartoDB and AWS credentials handy for copy/paste.
- Get dependencies, compile, and uberjar the repo using the
uj
command (also from.bashrc
.
Now you're ready to run the stats queries!
Launch a REPL for use with Hadoop: hadoop jar target/gulo-0.1.0-SNAPSHOT-standalone.jar clojure.main
. Then start the stats queries:
(use 'gulo.main)
(in-ns 'gulo.main)
(RunStats "s3n://vnproject/data/staging/*" "s3n://vnproject/stats")
The results will be stored at s3n://vnproject/stats
in directories corresponding to the name of the stats queries. Some of the results are a single number (e.g. number of records). Thus most partfiles for those queries will be empty. Look for the one file larger than 70 bytes. The remaining stats (e.g. record count by country) will be scattered across many textline part files. The easiest thing to do in that case is to download all the files using s3cmd
and cat
the files into one:
cd /tmp/
s3cmd get s3://vnproject/stats/2013-06-04/total-recs-by-country
cat total-recs-by-country/part* | sort > total-recs-by-country.txt
And you're done!
The easiest thing is just to follow the instructions in bulkload.sh
in the webapp
project.
The final element needed for the VertNet Portal is a set of tables in CartoDB. These are produced using the Teratorn project's Shred defmain.
The easiest way to get this running is to use Elastic Mapreduce. First set up a credentials.json
in a directory of your choosing. I use the gulo
project directory. From that directory, run this command to launch a cluster with one 4x.xlarge
slave node:
elastic-mapreduce --create --alive --name dev --availability-zone us-east-1d --ami-version 2.0.5 \
--instance-group master --instance-type m2.4xlarge --instance-count 1 --bid-price 0.75 \
--instance-group core --instance-type m2.4xlarge --instance-count 1 --bid-price 0.75 \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/add-swap \
--args 2048 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "-s,mapred.tasktracker.map.tasks.maximum=30,-s,mapred.tasktracker.reduce.tasks.maximum=24" \
--bootstrap-action s3://vnproject/bootstrap-actions/gulo/bootstrap.sh
The EMR bootstrap script creates a few handy shortcuts for handling lein
, dependencies, etc.
# Install lein
li
# Clone Gulo
gulo
# Clone Teratorn
teratorn
Gulo
is a dependency, and the easiest way to get it installed is to clone the project and install it on the cluster. You could get it from Clojars, but the project changes pretty quickly and the Clojars version doesn't /get updated very frequently.
# Install Gulo
cd ~/
cd gulo
Now you need to set up credentials for CartoDB and AWS. Run this bash script and have your CartoDB/AWS credentials handy. It adds credentials files to the Gulo project, and configures s3cmd
.
Ok, so now you can install Gulo:
cd gulo
lein install
cd ~/
That's it for config!
hadoop jar target/teratorn-0.1.0-SNAPSHOT-standalone.jar teratorn.vertnet.Shred "s3n://vnproject/data/staging/*"
Boom! Shouldn't take more than 20 minutes for 178k records.
So now we've got a bunch of textline part files sitting in /tmp
mkdir /mnt/beast
hadoop fs -copyToLocal /tmp/ /mnt/beast/
# add headers to output csv files
echo "tax_uuid scientificname kingdom phylum classs order family genus" > /mnt/beast/tax.csv
echo "loc_uuid lat lon" > /mnt/beast/loc.csv
echo "taxloc_uuid tax_uuid loc_uuid" > /mnt/beast/taxloc.csv
# for the occurrence table, make sure you've got the latest version of the occurrence fields
wget https://raw.github.com/VertNet/webapp/develop/tools/search/header.tsv
# pre-pend uuids
echo "taxloc_uuid tax_uuid loc_uuid occ_uuid" > /mnt/beast/occ.csv
# finalize occ table header
cat header.tsv >> /mnt/beast/occ.csv
# add output data to csv files
table="tax"; cat /mnt/beast/tmp/$table/part-* >> /mnt/beast/$table.csv
table="loc"; cat /tmp/$table/part-* >> /mnt/beast/$table.csv
table="tax-loc"; cat /tmp/$table/part-* >> /mnt/beast/$table.csv
table="occ"; cat /tmp/$table/part-* >> /mnt/beast/$table.csv
# zip things up
cd /mnt/beast
mkdir tables
table="tax"; zip tables/$table.zip $table.csv
table="loc"; zip tables/$table.zip $table.csv
table="tax-loc"; zip tables/$table.zip $table.csv
table="occ"; zip tables/$table.zip $table.csv
# put everything on S3
s3cmd -P put /mnt/beast/tables/*.zip s3://vnproject/tables/
S3cmd will print out all the public URLs of the different zip files. Use those to import the files into CartoDB at the CartoDB dashboard. This will change once we start using the JDBC tap, but for now it works fine.