Data collection from multiple streams script

Introduction

Script data_collection.py is used to collect data from multiple log streams (AAA, CMSSW, EOS, CRAB) and output them as sets of CSV files with predefined (same for all) attribute names.

Script is run for a specific date. It looks up streams for the provided date and uses them as input.

Directory for each stream and each day is created automatically, so records from different days and streams would not mix. For example: CMSSW/2016/09/15/. Output files are saved in CSV format (there is a possibility to use JSON format, but that requires altering the code a tiny bit).

If output directory already exists, it and all files and directories inside will be deleted before new records are exported there.

Output record format

Ideally all of these arguments should be present in records:

file name
file size
primds
procds
tier
site name
file replicas
user dn
start/end time
read bytes
cpu/wc values
source: xrootd, eos, cmssw, crab

However, none of these four streams provide all arguments.

Attribute	AAA (Xrootd)	CMSSW	EOS	JobMonitoring (CRAB)
file name	+	+	+	+
file size	+	+	+	+
primds	+	+	+	+
procds	+	+	+	+
tier	+	+	+	+
site name		+		+
file replicas
user dn	+	+	+
start/end time	+	+	+*	+
read bytes	+	+
cpu/wc values				+

* Same timestamp is used for both start and end times

Execution example

Script must be executed using run_spark script. User must provide date (as YYYYMMDD) and output directory. Yarn and verbose arguments are optional.

    run_spark data_agg.py --yarn --date 20160915 --fout hdfs:///cms/users/username/streams --verbose

Script arguments

--yarn - run job on analytics cluster via yarn resource manager.
--date YYYYMMDD - data will be aggregated for this exact day. It is also used to create output path. Note that date must be in YYYYMMDD format (YYYY - year, MM - two digit month, DD - two digit day).
--fout <path> - output root directory path. Output (CSV) files will be saved in <path>/Stream_name/Year/Month/Day directory (e.g. CMSSW files will be put in /cms/users/username/streams/CMSSW/2016/09/15/).
--verbose - if this flag is present then script will output some diagnostic information.

Output example

First three records from CMSSW/2017/04/12 in CSV format:

file_name,file_size,site_name,user_dn,start_time,end_time,read_bytes,source,primds,procds,tier
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1492023118,33899141,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1492037450,34059668,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1491995280,34057033,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data collection from multiple streams script

Introduction

Output record format

Execution example

Script arguments

Output example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally