-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Welcome to the mttbar
analysis documentation. This is the central analysis
repository for heavy resonance searches in the top-antitop semileptonic final state
using the columnflow framework.
In the following, a quick introduction to setting up the mttbar
analysis is given.
Before we begin, make sure you are in a clean shell session, i.e. you did not load
any software stack, and especially not CMSSW! Things will break if you try to
install the software inside CMSSW. Also check that you are not loading any software
in your .bashrc
, .profile
or similar.
Also make sure you have a valid grid proxy! You might not be be able to access the
input datasets otherwise. Typically, a grid proxy can be obtained by using the
voms-proxy-init
command, but the exact procedure varies by site. Consult your
colleagues/site admins for more information.
Once you have made sure no other software is loaded, clone the repository and install the software as follows:
git clone --recursive https://github.cern.ch/uhh-cms/mttbar.git
cd mttbar
source setup.sh dev
# ... answer questions (only during first setup)
Then, every time you log in, you need to execute the following to load the software:
source setup.sh dev
Next, make sure you have a valid grid proxy! You won't be able to access the
datasets otherwise. Typically this is set up using the voms-proxy-init
command,
but the exact procedure varies by site.
Now you are ready to run your first anlaysis task.
Tasks are the main unit of work in columnflow
-based analyses. Complex analysis
workflows are organized in a hierarchy of tasks that typically depend on the output
of previous tasks, referred to as their requirements or dependencies. Each task
also defines one or more output files, which typically serve as inputs to any tasks
that depend on it. When all outputs declared by a task are present, that task is
considered complete and will not be run again by the framework. The full
columnflow
task graph can be found
here.
Running a task can be done from the command line by invoking the law run
command,
followed by the task name (law
is the workflow management tool used by
columnflow
under the hood). Before a task is run, the framework checks if all
requirements for that task are complete (i.e. if all outputs of dependent tasks
are present). If a dependency output is not present, the corresponding task will be
scheduled to run automatically before running the main task. This is done for all
tasks in the hierarchy, until all intermediate outputs have been produced.
To see a list of available tasks, run law index --verbose
.
Tasks have many command-line parameters that can be set to control their behavior.
For example, the --dataset
parameter defines which dataset to run over,
--selector
defines the selector used for filtering events and/or physics object
collections, and --calibrators
and --producers
control which object calibrations
to apply and which additional column producers to run. The full set of available
parameters depend on the task. Passing --help
to the law run <task_name>
command
will provide a list of parameters provided by the task.
Let's run an example task now. A standard task in any analysis involves selecting
events and/or filtering physics objects inside events based on a set of criteria.
In columnflow
, these criteria are defined via so-called selectors
and can be
applied by running a task called cf.SelectEvents
. This task needs a --dataset
parameter so it knows what data to run the selection on, and a --version
parameter, which is a string that represents the current state of the analysis
at a given point in time. The version string can be freely chosen, although
it is recommended to use a sequential pattern like v0
, v1
, etc.
To run the selection for the ttbar
dileptonic MC dataset, run the following
command:
law run cf.SelectEvents --version v0 --dataset tt_dl_powheg --branch 0
The --branch 0
tells the framework to run only on the first file of the dataset,
which is useful for testing. When running this task, you should notice that a
series of other tasks are also scheduled. One is cf.CalibrateEvents
, which is a
direct dependency of cf.SelectEvents
and is responsible for applying object
calibrations (e.g. jet energy corrections), which typically needs to be done before
any selections are applied. This task further depends on cf.GetDatasetLFNs
, which
queries the CMS Data Aggregation System (DAS) to get the actual list of filenames
belonging to the dataset (the LFNs
stands for logical file names).
Once all the scheduled tasks have been run, we can go ahead and inspect the outputs.
Task outputs are normally located under data/mtt_store/analysis_mtt/
.
An output directory is created here for each task, and each directory contains the
outputs of that task, organized in a series of nested subdirectories that reflect
the parameters with which the task was called.
Tasks can produce different types of files. Most commonly these will be either
parquet
files (which contain event information organized in an AwkwardArray)
or pickle
files (which contain histograms or other Python objects). Some tasks also
output structured information stored in a json
file, and plotting tasks produce
either pdf
or png
files.
To check the output path of a task, add --print-output 0
to the task call:
$> law run cf.SelectEvents --version v0 --dataset tt_dl_powheg --branch 0 --print-output 0
print task output with max_depth 0, showing schemes
file:///<path_to_storage>/cf.SelectEvents/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/v0/results_0.parquet
file:///<path_to_storage>/cf.SelectEvents/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/v0/stats_0.json
file:///<path_to_storage>/cf.SelectEvents/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/v0/columns_0.parquet
As can be seen, the task has produced two parquet
files. One of these
(results.parquet
) contains the selection results, and the other (columns.parquet
)
contains additional data columns produced during the selection. A json
file with
some selection statistics is also produced.
To see the contents of these files, a useful tool is the script cf_inspect
, which
can be used to inspect the outputs interactively in an IPython session.
Let's try this out on one of the parquet
files. We'll use the environment variable
$CF_STORE_LOCAL
, which points by default to the data/mtt_store
directory:
cf_inspect $CF_STORE_LOCAL/analysis_mtt/cf.SelectEvents/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/v0/{results,columns}_0.parquet
IPython will open and load the contents of each parquet
file into a local
variable called objects
. In the example above, we loaded the AwkwardArrays
contained in two files, results.parquet
and columns.parquet
. We can unpack the
objects
variable to two separate variables for convenience, and for example look at
the fields
property to see what columns the arrays contain.
In [1]: results, columns = objects
In [2]: columns.fields
Out[2]:
['category_ids',
'cutflow',
'event',
'luminosityBlock',
'mc_weight',
'process_id',
'run']
Experiment a bit, then close the IPython session when you're done using Ctrl
+
D
.
To create histograms, call the cf.MergeHistograms
task. You need to specify the
variable(s) to create the histogram(s) for. As an example, we will use jet1_eta
here, which represents the pseudorapidity of the jet with the highest transverse
momentum:
law run cf.MergeHistograms --config run2_2017_nano_v9_limited --version v0 --dataset tt_dl_powheg --variables jet1_eta
We can now look at the Python histograms using cf_inspect
like above. First we get
the output location of the pickle
files where the histograms are stored:
$> law run cf.MergeHistograms --config run2_2017_nano_v9_limited --version v0 --dataset tt_dl_powheg --variables jet1_eta --print-output 0
print task output with max_depth 0, showing schemes
file:///<path_to_storage>/cf.MergeHistograms/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/prod__default/v0/hist__jet1_eta.pickle
We can open the histogram files using cf_inspect
:
cf_inspect $CF_STORE_LOCAL/analysis_mtt/cf.MergeHistograms/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/prod__default/v0/hist__jet1_eta.pickle
Then in the IPython shell, get the histogram object from objects[0]
and inspect it:
In [1]: h = objects[0]
In [2]: h
Out[2]:
Hist(
IntCategory([0, 1.2478e+09, 1.05419e+09, 1.1463e+09], growth=True, name='category'),
IntCategory([1200], growth=True, name='process'),
IntCategory([0], growth=True, name='shift'),
Variable([-2.5, -1.57, -1.44, -1.3, -0.7, 0, 0.7, 1.3, 1.44, 1.57, 2.5], name='jet1_eta', label='Jet 1 $\\eta$'),
storage=Weight()) # Sum: WeightedSum(value=146.288, variance=563.165)
As you can see, this is a four-dimensional histogram. It stores not only the
variable
information, but also information about the category
, process
and
shift
. Let's ignore the latter three axes for the moment and project them out so we
get a regular one-dimensional histogram. This can be done with a special syntax, for
example like this:
h_1d = h[{"category": hist.loc(0), "process": 0, "shift": 0}]
You can now display the histogram with h_1d.show()
. For more information on Python
histograms, check out the hist
package documentation at:
https://hist.readthedocs.io/en/latest/user-guide/quickstart.html
When you are done experimenting, quit the IPython shell with Ctrl
+ D
.
Finally, we can also produce a plot of the histogram using cf.PlotVariables1D
:
law run cf.PlotVariables1D --config run2_2017_nano_v9_limited --version v0 --datasets tt_dl_powheg --variables jet1_eta
- plot other variables (try adding the electron pT, i.e.
--variables jet1_eta,jet1_pt
) - use more datasets (try including a
data
sample, i.e.--datasets tt_dl_powheg,data_mu_f
) - use the batch system (add
--workflow htcondor
to the above tasks) - run over full statistics (remove the
_limited
from the config, i.e.--config run2_2017_nano_v9
)- note: it only makes sense to use this together with
--workflow htcondor
, otherwise everything will run locally and will take a very long time
- note: it only makes sense to use this together with
- columnflow: main framework repository
-
law: workflow management system, built on top of
luigi
- luigi: base package for task specification and dependency management
- order: Pythonic tools for organization of analysis metadata
- awkward: numpy-like array objects for nested, variable-sized data
- hist: Python library for working with histograms
- Source hosted at GitHub
- Report issues, questions, feature requests on GitHub Issues