Setup and usage

Accessing HDFS

To run CMSSpark you'll need to access data on HDFS (Hadoop File System). To do so please either configure your VM properly (CMS-VOC has proper puppet configuration to access HDFS) or setup an account on CERN analytix cluster. Once you'll have an account please login to the node where you'll run the code and proceed.

Cloning from github

Project files are stored in github repository. Files can be cloned by using:

git clone https://github.com/vkuznet/CMSSpark.git

Configuration

In order to run CMSSpark code you'll need to configure it initially. The you may use a default configuration file found in etc/conf.json.

The configuration files should provides the following attributes:

location of aaa,cmssw,eos,jm directories to read data from defined as aaa_dir, cmssw_dir, eos_dir and jm_dir, respectively
output directory to store aggregation results on HDFS
sompt path to locate stomp.py module
credentions should point to a json file which contains credentials information
keytab provides location of kerberos keytab file to be used

Here is an example of configuration file:

{
    "output_dir" : "hdfs:///cms/users/vk/agg",
    "stomp_path" : "/data/users/vk/CMSSpark/static/stomp.py-4.1.15-py2.7.egg",
    "credentials" : "/data/users/vk/CMSSpark/acronjob/amq_broker.json",
    "keytab" : "/data/users/vk/CMSSpark/acronjob/agg.keytab",
    "aaa_dir" : "/project/monitoring/archive/xrootd/enr/gled",
    "cmssw_dir" : "/project/awg/cms/cmssw-popularity/avro-snappy",
    "eos_dir" : "/project/monitoring/archive/eos/logs/reports/cms",
    "jm_dir" : "/project/awg/cms/jm-data-popularity/avro-snappy"
}

keytab file

Keytab file is used for renewing session while running as cron job. Keytab file can be generated with following commands (replace <username> with your username and feel free to use any name you wish instead of agg.keytab which is used here):

cd /tmp
ktutil
ktutil:  addent -password -p <username>@CERN.CH -k 1 -e rc4-hmac
ktutil:  addent -password -p <username>@CERN.CH -k 1 -e aes256-cts
ktutil:  wkt agg.keytab
ktutil:  quit

mv agg.keytab <YOUR_DIRECTORY>

Once you'll generate your keytab file please remember to restrict access to it since it contains your Kerberos credentials, we suggest to use change its permission as following:

chmod 0600 <YOUR_DIRECTORY>/agg.keytab

Finally, adjust your CMSSpark config to point to your keytab file.

How to run CMSSpark

Once you'll have your code in place you'll need to setup proper environment, e.g.

export PYTHONPATH=<YOUR_PATH>/CMSSpark/src/python:$PYTHONPATH
export PATH=<YOUR_PATH>/CMSSpark/bin:$PATH

Then you can run CMSSpark scripts as following:

run_spark <script_name> <parameters

For example, to run dbs_condor.py workflow you'll invoke it as

run_spark dbs_condor.py --fout=hdfs:///cms/users/vk/dbs_condor --yarn

Here are more examples how to run different workflows:

# DBS+PhEDEx
apatterns="*BUNNIES*,*Commissioning*,*RelVal*"
hdir=hdfs:///cms/users/vk/datasets
run_spark dbs_phedex.py --fout=$hdir --antipatterns=$apatterns --yarn --verbose

# DBS+CMSSW
run_spark dbs_cmssw.py --verbose --yarn --fout=hdfs:///cms/users/vk/cmssw --date=20170411

# DBS+AAA
run_spark dbs_aaa.py --verbose --yarn --fout=hdfs:///cms/users/vk/aaa --date=20170411

# DBS+EOS
run_spark dbs_eos.py --verbose --yarn --fout=hdfs:///cms/users/vk/eos --date=20170411

# WMArchive examples:
run_spark wmarchive.py --fout=hdfs:///cms/users/vk/wma --date=20170411
run_spark wmarchive.py --fout=hdfs:///cms/users/vk/wma --date=20170411,20170420 --yarn
run_spark wmarchive.py --fout=hdfs:///cms/users/vk/wma --date=20170411-20170420 --yarn

Provide feedback

Saved searches