Tag and probe analysis using Apache Spark.
This package uses Apache Spark clusters. More details on CERN's Apache Spark can be found here.
Important: If you want to use the CERN analytix cluster (which is much faster to startup than the k8s cluster), you need to request access to the cluster in the help document found here.
The following will produce a set of example efficiencies (assuming you can run on analytix):
git clone https://github.com/dntaylor/spark_tnp.git
cd spark_tnp
source env.sh
kinit
./tnp_fitter.py flatten muon generalTracks Z Run2018_UL configs/muon_example.json --baseDir ./example
./tnp_fitter.py fit muon generalTracks Z Run2018_UL configs/muon_example.json --baseDir ./example
./tnp_fitter.py prepare muon generalTracks Z Run2018_UL configs/muon_example.json --baseDir ./example
There are example notebooks in the notebooks directory demonstrating the use of these tools interactively. A good starting point is to follow the instructions in MuonTnP.ipynb. These notebooks use https://swan.cern.ch to connect to the Apache Spark clusters at CERN.
There are a couple of ways you can run. Either connect to the edge node or directly on lxplus.
The jobs are run on spark clusters and the data is read from an hdfs cluster.
The default (and preferred) way is to use the analytix
spark and hdfs cluster.
Connect to the hadoop edge node (from within the CERN network):
ssh it-hadoop-client
Setup the environment:
kinit
source /cvmfs/sft.cern.ch/lcg/views/LCG_97python3/x86_64-centos7-gcc8-opt/setup.sh
source hadoop-setconf.sh analytix
Connect to LXPLUS:
ssh lxplus.cern.ch
Setup the environment:
source env.sh
Note: Do not forget to make sure you have a valid kerberos token with:
kinit
Install tqdm
packaged for a nice progressbar.
pip install --user tqdm
The tag-and-probe process is broken down into several parts:
- Creation of the flat ROOT tag-and-probe trees (not shown here)
- Conversion of the ROOT TTree into the parquet data format
- Reduce the data into binned histograms with spark
- Fit the resulting histograms
- Extraction of efficiencies and scale factors
These steps are controlled with the tnp_fitter.py script. For help with the script run:
./tnp_fitter.py -h
The most important argument to pass is the configuration file that controls what kind of fits are produced. See detailed documentation in the configs directory.
New tag-and-probe datasets will need to be registered in the data directory.
The conversion to parquet vastly speeds up the later steps.
We will use laurelin to
read the root files and then write them in the parquet data format.
There are two possible approaches: using k8s
and using analytix
.
Conversion with k8s
currently only works if you are using https://swan.cern.ch.
Use the RootToParquet notebook as a guide.
The output should be written to analytix
.
Conversion with analytix
requires you to first copy your root files
to hdfs://analytix
. There is an issue with reading root files from eos
on analytix
that needs to be understood.
The following should be executed when you are connected to the edge node.
hdfs dfs -cp root://eoscms.cern.ch//eos/cms/store/[path-to-files]/*.root hdfs://analytix/[path-to-out-dir]
Additionally, you will need to download the jar
files to add
to the spark executors:
bash setup.sh
Once copied, you can use:
./tnp_fitter.py convert [particle] [probe] [resonance] [era]
Note: this will currently raise a NotImplemented
exception.
You can look at converter.py for how to run things
until it is incorporated.
This step uses the converted parquet data format to efficiently aggregate the efficiency data into binned histograms.
./tnp_fitter.py flatten -h
For example, to flatten all histograms for the Run2017 Legacy muon scalefactors from Z:
./tnp_fitter.py flatten muon generalTracks Z Run2017_UL configs/muon_pog_official_run2_Z_2017.json
You can optionally filter the efficiencies and shifts you flatten with the --numerator
,
--denominator
, and --shiftType
arguments. Thus, to only flatten the nominal histograms do:
./tnp_fitter.py flatten muon generalTracks Z Run2017_UL configs/muon_pog_official_run2_Z_2017.json --shiftType Nominal
Note: running this on lxplus will give the following warnings:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
WARN TableMapping: /etc/hadoop/conf/topology.table.file cannot be read.
java.io.FileNotFoundException: /etc/hadoop/conf/topology.table.file (No such file or directory)
...
WARN TableMapping: Failed to read topology table. /default-rack will be used for all nodes.
and
WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
which can be safely ignored.
Histogram fitting uses local running or condor.
Note: the first time you run a fit it will compile the Root classes. Don't use -j
option the first time you run fits.
It will try to compile the modules multiple times and throw errors. Instead, use single core to run one fit, then ctrl-c and use the -j
option
(it won't compile again).
To run locally (with 16 threads):
./tnp_fitter.py fit muon generalTracks Z Run2017_UL configs/muon_pog_official_run2_Z_2017.json -j 16
To submit to condor:
./tnp_fitter.py fit muon generalTracks Z Run2017_UL configs/muon_pog_official_run2_Z_2017.json --condor
condor_submit condor.sub
The histograms which are fit can be controlled with optional filters. See documentation with:
./tnp_fitter.py fit -h
There is a simple automatic recovery processing that can be run (in case of condor failures). More advanced options (such as using statistical tests to evaluate the GOF) are still being implemented.
./tnp_fitter.py fit muon generalTracks Z Run2017_UL configs/muon_pog_official_run2_Z_2017.json -j 16 --recover
Plots and scalefactors can the be extracted with:
./tnp_fitter.py prepare muon generalTracks Z Run2017_UL configs/muon_pog_official_run2_Z_2017.json --condor
Note: this is still a WIP.
The make_pileup.py script produced the pileup distribution in MC. This part requires a CMSSW environment sourced.
To make the data pileup, copy the latest PileupHistogram from:
/afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/{COLLISION_ERA}/{ENERGY}/PileUp/PileupHistogram-{...}.root
You should grab the 69200ub
version. If you wish to explore systematic uncertainties
in the choice of the minbias cross section, use the up (66000ub
) and down (72400ub
) histograms.
Alternatively, you can make it yourself with (e.g. Run2017):
lumimask=/afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions17/13TeV/ReReco/Cert_294927-306462_13TeV_EOY2017ReReco_Collisions17_JSON.txt
pileupjson=/afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions17/13TeV/PileUp/pileup_latest.txt
xsec=69200
maxBins=100
pileupCalc.py -i $lumimask --inputLumiJSON $pileupjson --calcMode true --minBiasXsec $xsec --maxPileupBin $maxBins --numPileupBins $maxBins pileup/data/Run2017.root