Skip to content

Data collection from multiple streams script

Valentin Kuznetsov edited this page Sep 26, 2017 · 1 revision

Introduction

Script data_collection.py is used to collect data from multiple log streams (AAA, CMSSW, EOS, CRAB) and output them as sets of CSV files with predefined (same for all) attribute names.

Script is run for a specific date. It looks up streams for the provided date and uses them as input.

Directory for each stream and each day is created automatically, so records from different days and streams would not mix. For example: CMSSW/2016/09/15/. Output files are saved in CSV format (there is a possibility to use JSON format, but that requires altering the code a tiny bit).

If output directory already exists, it and all files and directories inside will be deleted before new records are exported there.

Output record format

Ideally all of these arguments should be present in records:

  • file name
  • file size
  • primds
  • procds
  • tier
  • site name
  • file replicas
  • user dn
  • start/end time
  • read bytes
  • cpu/wc values
  • source: xrootd, eos, cmssw, crab

However, none of these four streams provide all arguments.

Attribute AAA (Xrootd) CMSSW EOS JobMonitoring (CRAB)
file name + + + +
file size + + + +
primds + + + +
procds + + + +
tier + + + +
site name + +
file replicas
user dn + + +
start/end time + + +* +
read bytes + +
cpu/wc values +

* Same timestamp is used for both start and end times

Execution example

Script must be executed using run_spark script. User must provide date (as YYYYMMDD) and output directory. Yarn and verbose arguments are optional.

    run_spark data_agg.py --yarn --date 20160915 --fout hdfs:///cms/users/username/streams --verbose
Script arguments
  • --yarn - run job on analytics cluster via yarn resource manager.
  • --date YYYYMMDD - data will be aggregated for this exact day. It is also used to create output path. Note that date must be in YYYYMMDD format (YYYY - year, MM - two digit month, DD - two digit day).
  • --fout <path> - output root directory path. Output (CSV) files will be saved in <path>/Stream_name/Year/Month/Day directory (e.g. CMSSW files will be put in /cms/users/username/streams/CMSSW/2016/09/15/).
  • --verbose - if this flag is present then script will output some diagnostic information.

Output example

First three records from CMSSW/2017/04/12 in CSV format:

file_name,file_size,site_name,user_dn,start_time,end_time,read_bytes,source,primds,procds,tier
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1492023118,33899141,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1492037450,34059668,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1491995280,34057033,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO