Skip to content

WMArchive tools

Valentin Kuznetsov edited this page Feb 8, 2016 · 10 revisions

Tools

Here we outline useful tools used by WMArchive. All tools are written in python and associative bash wrapper script is provided for each of them (this is mostly for convenience, e.g. setup proper PYTHONPATH):

dump2hdfs tool dumps file into HDFS as is, i.e. it will neither modify file content or convert it into Avro format. This is useful to store some files into HDFS, e.g. avro schema files.

dump2hdfs --help
usage: dump2hdfs [-h] [--fin FIN]

optional arguments:
  -h, --help  show this help message and exit
  --fin FIN   Input avro schema file

json2avsc tool is designed to convert given JSON data file into Avro schema file.

json2avsc --help
usage: json2avsc [-h] [--fin FIN] [--fout FOUT]

optional arguments:
  -h, --help   show this help message and exit
  --fin FIN    Input JSON file
  --fout FOUT  Output Avro schema file

json2avro converts given JSON file to Avro data-format following provided Avro schema.

json2avro --help
usage: mongo2hdfs [-h] [--fin FIN] [--schema SCHEMA] [--fout FOUT]

optional arguments:
  -h, --help       show this help message and exit
  --fin FIN        Input JSON file
  --schema SCHEMA  Input Avro schema
  --fout FOUT      Output Avro file

mongo2hdfs tool migrate data from MongoDB into HDFS based on mongodb storage record type, i.e. it reads records with mongodb storage type from MongoDB and dumps them into HDFS. After successful dump it updates MongoDB storage types to hdfsio.

mongo2hdfs --help
usage: mongo2hdfs [-h] [--mongo MURI] [--hdfs HURI]

optional arguments:
  -h, --help    show this help message and exit
  --mongo MURI  MongoDB URI
  --hdfs HURI   HDFS URI

mongo2avro performs migration of data from MongoDB to avro files on local file system. These files are written in append mode to accumulate data upon provided threshold.

mongo2avro --help
usage: mongo2hdfs [-h] [--mongo MURI] [--schema SCHEMA] [--odir ODIR]
                  [--thr THR] [--chunk CHUNK] [--verbose]

optional arguments:
  -h, --help       show this help message and exit
  --mongo MURI     MongoDB URI
  --schema SCHEMA  Avro schema file
  --odir ODIR      Avro output area
  --thr THR        Avro file size threshold in MB, default 256MB
  --chunk CHUNK    Chunk size for reading Mongo docs, default 1000
  --verbose        Verbose output

mongocleanup performs clean-up of data records in MongoDB. It is based on two factors, first, only records with hdfsio storage type are read and, second, only those which exceeds their lifetime are wiped out from MongoDB.

mongocleanup --help
usage: mongocleanup [-h] [--mongo MURI] [--tstamp TSTAMP]

optional arguments:
  -h, --help       show this help message and exit
  --mongo MURI     MongoDB URI
  --tstamp TSTAMP  Lifetime timestamp

mrjob generates (and optionally executes) MR bash script for end-user.

mrjob --help
usage: mrjob [-h] [--hdir HDIR] [--odir ODIR] [--schema SCHEMA] [--mrpy MRPY]
             [--pydoop PYDOOP] [--avro AVRO] [--execute] [--verbose]

Tool to generate and/or execute MapReduce (MR) script. The code is generated
from MR skeleton provided by WMArchive and user based MR file. The later must
contain two functions: mapper(ctx) and reducer(ctx) for given context. Their
simplest implementation can be found here WMArchive/MapReduce/mruser.py
Based on this code please create your own mapper/reducer functions and use this
tool to generate final MR script.

optional arguments:
  -h, --help       show this help message and exit
  --hdir HDIR      HDFS input data directory
  --odir ODIR      HDFS output directory for MR jobs
  --schema SCHEMA  Data schema file on HDFS
  --mrpy MRPY      MapReduce python script
  --pydoop PYDOOP  pydoop archive file, e.g. /path/pydoop.tgz
  --avro AVRO      avro archive file, e.g. /path/avro.tgz
  --execute        Execute generate mr job script
  --verbose        Verbose output

myspark is a tool to execute python code on Spark platform. End-user is responsible to write his/her own executor (mapper) function.

myspark --help
usage: PROG [-h] [--hdir HDIR] [--schema SCHEMA] [--script SCRIPT] [--verbose]

optional arguments:
  -h, --help       show this help message and exit
  --hdir HDIR      Input data location on HDFS, e.g. hdfs:///path/data
  --schema SCHEMA  Input schema, e.g. hdfs:///path/fwjr.avsc
  --script SCRIPT  python script with custom mapper/reducer functions
  --verbose        verbose output

How to create/read/write avro files from command line using Java

Although WMArchvie does not rely on Java, here we provide basic instructions how someone can use java to read/write avro files (this is handy once we want to verify that our tools will work with Java). First we need to download avro-tools jar:

http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
curl -O http://apache.arvixe.com/avro/avro-1.7.7/java/avro-tools-1.7.7.jar

Then, we create avro schema using json2avsc tool:

bin/json2avsc --fin=simple.json --fout=simple.avsc

and, now we can create avro file using java avro-tools and verify that it can be translated back to json correctly:

java -jar avro-tools-1.7.7.jar fromjson --schema-file simple.avsc simple.json > simple.avro
java -jar avro-tools-1.7.7.jar tojson simple.avro > s.json
diff simple.json s.json