-
Notifications
You must be signed in to change notification settings - Fork 21
Data collection from multiple streams script
Script data_collection.py is used to collect data from multiple log streams (AAA, CMSSW, EOS, CRAB) and output them as sets of CSV files with predefined (same for all) attribute names.
Script is run for a specific date. It looks up streams for the provided date and uses them as input.
Directory for each stream and each day is created automatically, so records from different days and streams would not mix. For example: CMSSW/2016/09/15/
. Output files are saved in CSV format (there is a possibility to use JSON format, but that requires altering the code a tiny bit).
If output directory already exists, it and all files and directories inside will be deleted before new records are exported there.
Ideally all of these arguments should be present in records:
- file name
- file size
- primds
- procds
- tier
- site name
- file replicas
- user dn
- start/end time
- read bytes
- cpu/wc values
- source: xrootd, eos, cmssw, crab
However, none of these four streams provide all arguments.
Attribute | AAA (Xrootd) | CMSSW | EOS | JobMonitoring (CRAB) |
---|---|---|---|---|
file name | + | + | + | + |
file size | + | + | + | + |
primds | + | + | + | + |
procds | + | + | + | + |
tier | + | + | + | + |
site name | + | + | ||
file replicas | ||||
user dn | + | + | + | |
start/end time | + | + | +* | + |
read bytes | + | + | ||
cpu/wc values | + |
* Same timestamp is used for both start and end times
Script must be executed using run_spark script. User must provide date (as YYYYMMDD) and output directory. Yarn and verbose arguments are optional.
run_spark data_agg.py --yarn --date 20160915 --fout hdfs:///cms/users/username/streams --verbose
-
--yarn
- run job on analytics cluster via yarn resource manager. -
--date YYYYMMDD
- data will be aggregated for this exact day. It is also used to create output path. Note that date must be inYYYYMMDD
format (YYYY - year, MM - two digit month, DD - two digit day). -
--fout <path>
- output root directory path. Output (CSV) files will be saved in<path>/Stream_name/Year/Month/Day
directory (e.g. CMSSW files will be put in/cms/users/username/streams/CMSSW/2016/09/15/
). -
--verbose
- if this flag is present then script will output some diagnostic information.
First three records from CMSSW/2017/04/12
in CSV format:
file_name,file_size,site_name,user_dn,start_time,end_time,read_bytes,source,primds,procds,tier
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1492023118,33899141,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1492037450,34059668,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO
"/store/data/Run2011B/SingleElectron/RAW-RECO/WElectron-19Nov2011-v1/0000/02DB8323-191B-E111-A9C5-003048D4390A.root",0,T2_CH_CERN,/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsbuild/CN=545661/CN=Robot: CMS Build,null,1491995280,34057033,cmssw,SingleElectron,Run2011B-WElectron-19Nov2011-v1,RAW-RECO