-
Notifications
You must be signed in to change notification settings - Fork 13
WMArchive tools
Here we outline useful tools used by WMArchive. All tools are written in python and associative bash wrapper script is provided for each of them (this is mostly for convenience, e.g. setup proper PYTHONPATH):
dump2hdfs tool dumps file into HDFS as is, i.e. it will neither modify file content or convert it into Avro format. This is useful to store some files into HDFS, e.g. avro schema files.
dump2hdfs --help
usage: dump2hdfs [-h] [--fin FIN]
optional arguments:
-h, --help show this help message and exit
--fin FIN Input avro schema file
json2avsc tool is designed to convert given JSON data file into Avro schema file.
json2avsc --help
usage: json2avsc [-h] [--fin FIN] [--fout FOUT]
optional arguments:
-h, --help show this help message and exit
--fin FIN Input JSON file
--fout FOUT Output Avro schema file
json2avro converts given JSON file to Avro data-format following provided Avro schema.
json2avro --help
usage: mongo2hdfs [-h] [--fin FIN] [--schema SCHEMA] [--fout FOUT]
optional arguments:
-h, --help show this help message and exit
--fin FIN Input JSON file
--schema SCHEMA Input Avro schema
--fout FOUT Output Avro file
mongo2hdfs tool migrate data from MongoDB into HDFS based on mongodb
storage record type, i.e. it reads records with mongodb
storage type from MongoDB and dumps them into HDFS. After successful dump it updates MongoDB storage types to hdfsio
.
mongo2hdfs --help
usage: mongo2hdfs [-h] [--mongo MURI] [--hdfs HURI]
optional arguments:
-h, --help show this help message and exit
--mongo MURI MongoDB URI
--hdfs HURI HDFS URI
mongo2avro performs migration of data from MongoDB to avro files on local file system. These files are written in append mode to accumulate data upon provided threshold.
mongo2avro --help
usage: mongo2hdfs [-h] [--mongo MURI] [--schema SCHEMA] [--odir ODIR]
[--thr THR] [--chunk CHUNK] [--verbose]
optional arguments:
-h, --help show this help message and exit
--mongo MURI MongoDB URI
--schema SCHEMA Avro schema file
--odir ODIR Avro output area
--thr THR Avro file size threshold in MB, default 256MB
--chunk CHUNK Chunk size for reading Mongo docs, default 1000
--verbose Verbose output
mongocleanup performs clean-up of data records in MongoDB. It is based on two factors, first, only records with hdfsio
storage type are read and, second, only those which exceeds their lifetime are wiped out from MongoDB.
mongocleanup --help
usage: mongocleanup [-h] [--mongo MURI] [--tstamp TSTAMP]
optional arguments:
-h, --help show this help message and exit
--mongo MURI MongoDB URI
--tstamp TSTAMP Lifetime timestamp
mrjob generates (and optionally executes) MR bash script for end-user.
mrjob --help
usage: mrjob [-h] [--hdir HDIR] [--odir ODIR] [--schema SCHEMA] [--mrpy MRPY]
[--pydoop PYDOOP] [--avro AVRO] [--execute] [--verbose]
Tool to generate and/or execute MapReduce (MR) script. The code is generated
from MR skeleton provided by WMArchive and user based MR file. The later must
contain two functions: mapper(ctx) and reducer(ctx) for given context. Their
simplest implementation can be found here WMArchive/MapReduce/mruser.py
Based on this code please create your own mapper/reducer functions and use this
tool to generate final MR script.
optional arguments:
-h, --help show this help message and exit
--hdir HDIR HDFS input data directory
--odir ODIR HDFS output directory for MR jobs
--schema SCHEMA Data schema file on HDFS
--mrpy MRPY MapReduce python script
--pydoop PYDOOP pydoop archive file, e.g. /path/pydoop.tgz
--avro AVRO avro archive file, e.g. /path/avro.tgz
--execute Execute generate mr job script
--verbose Verbose output
myspark is a tool to execute python code on Spark platform. End-user is responsible to write his/her own executor (mapper) function.
myspark --help
usage: PROG [-h] [--hdir HDIR] [--schema SCHEMA] [--script SCRIPT] [--verbose]
optional arguments:
-h, --help show this help message and exit
--hdir HDIR Input data location on HDFS, e.g. hdfs:///path/data
--schema SCHEMA Input schema, e.g. hdfs:///path/fwjr.avsc
--script SCRIPT python script with custom mapper/reducer functions
--verbose verbose output
Although WMArchvie does not rely on Java, here we provide basic instructions how someone can use java to read/write avro files (this is handy once we want to verify that our tools will work with Java). First we need to download avro-tools jar:
http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
curl -O http://apache.arvixe.com/avro/avro-1.7.7/java/avro-tools-1.7.7.jar
Then, we create avro schema using json2avsc
tool:
bin/json2avsc --fin=simple.json --fout=simple.avsc
and, now we can create avro file using java avro-tools and verify that it can be translated back to json correctly:
java -jar avro-tools-1.7.7.jar fromjson --schema-file simple.avsc simple.json > simple.avro
java -jar avro-tools-1.7.7.jar tojson simple.avro > s.json
diff simple.json s.json