Skip to content
Daniel Savoiu edited this page Dec 13, 2024 · 4 revisions

Welcome to the mttbar analysis documentation. This is the central analysis repository for heavy resonance searches in the top-antitop semileptonic final state using the columnflow framework.

In the following, a quick introduction to setting up the mttbar analysis is given.

Preparations

Before we begin, make sure you are in a clean shell session, i.e. you did not load any software stack, and especially not CMSSW! Things will break if you try to install the software inside CMSSW. Also check that you are not loading any software in your .bashrc, .profile or similar.

Also make sure you have a valid grid proxy! You might not be be able to access the input datasets otherwise. Typically, a grid proxy can be obtained by using the voms-proxy-init command, but the exact procedure varies by site. Consult your colleagues/site admins for more information.

Installing the software

Once you have made sure no other software is loaded, clone the repository and install the software as follows:

git clone --recursive https://github.cern.ch/uhh-cms/mttbar.git
cd mttbar
source setup.sh dev
# ... answer questions (only during first setup)

Then, every time you log in, you need to execute the following to load the software:

source setup.sh dev

Next, make sure you have a valid grid proxy! You won't be able to access the datasets otherwise. Typically this is set up using the voms-proxy-init command, but the exact procedure varies by site.

Now you are ready to run your first anlaysis task.

Running tasks

Tasks are the main unit of work in columnflow-based analyses. Complex analysis workflows are organized in a hierarchy of tasks that typically depend on the output of previous tasks, referred to as their requirements or dependencies. Each task also defines one or more output files, which typically serve as inputs to any tasks that depend on it. When all outputs declared by a task are present, that task is considered complete and will not be run again by the framework. The full columnflow task graph can be found here.

Running a task can be done from the command line by invoking the law run command, followed by the task name (law is the workflow management tool used by columnflow under the hood). Before a task is run, the framework checks if all requirements for that task are complete (i.e. if all outputs of dependent tasks are present). If a dependency output is not present, the corresponding task will be scheduled to run automatically before running the main task. This is done for all tasks in the hierarchy, until all intermediate outputs have been produced.

To see a list of available tasks, run law index --verbose.

Tasks have many command-line parameters that can be set to control their behavior. For example, the --dataset parameter defines which dataset to run over, --selector defines the selector used for filtering events and/or physics object collections, and --calibrators and --producers control which object calibrations to apply and which additional column producers to run. The full set of available parameters depend on the task. Passing --help to the law run <task_name> command will provide a list of parameters provided by the task.

Let's run an example task now. A standard task in any analysis involves selecting events and/or filtering physics objects inside events based on a set of criteria. In columnflow, these criteria are defined via so-called selectors and can be applied by running a task called cf.SelectEvents. This task needs a --dataset parameter so it knows what data to run the selection on, and a --version parameter, which is a string that represents the current state of the analysis at a given point in time. The version string can be freely chosen, although it is recommended to use a sequential pattern like v0, v1, etc.

To run the selection for the ttbar dileptonic MC dataset, run the following command:

law run cf.SelectEvents --version v0 --dataset tt_dl_powheg --branch 0

The --branch 0 tells the framework to run only on the first file of the dataset, which is useful for testing. When running this task, you should notice that a series of other tasks are also scheduled. One is cf.CalibrateEvents, which is a direct dependency of cf.SelectEvents and is responsible for applying object calibrations (e.g. jet energy corrections), which typically needs to be done before any selections are applied. This task further depends on cf.GetDatasetLFNs, which queries the CMS Data Aggregation System (DAS) to get the actual list of filenames belonging to the dataset (the LFNs stands for logical file names).

Once all the scheduled tasks have been run, we can go ahead and inspect the outputs.

Inspecting task outputs

Task outputs are normally located under data/mtt_store/analysis_mtt/. An output directory is created here for each task, and each directory contains the outputs of that task, organized in a series of nested subdirectories that reflect the parameters with which the task was called.

Tasks can produce different types of files. Most commonly these will be either parquet files (which contain event information organized in an AwkwardArray) or pickle files (which contain histograms or other Python objects). Some tasks also output structured information stored in a json file, and plotting tasks produce either pdf or png files.

To check the output path of a task, add --print-output 0 to the task call:

$> law run cf.SelectEvents --version v0 --dataset tt_dl_powheg --branch 0 --print-output 0

print task output with max_depth 0, showing schemes

file:///<path_to_storage>/cf.SelectEvents/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/v0/results_0.parquet
file:///<path_to_storage>/cf.SelectEvents/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/v0/stats_0.json
file:///<path_to_storage>/cf.SelectEvents/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/v0/columns_0.parquet

As can be seen, the task has produced two parquet files. One of these (results.parquet) contains the selection results, and the other (columns.parquet) contains additional data columns produced during the selection. A json file with some selection statistics is also produced.

To see the contents of these files, a useful tool is the script cf_inspect, which can be used to inspect the outputs interactively in an IPython session. Let's try this out on one of the parquet files. We'll use the environment variable $CF_STORE_LOCAL, which points by default to the data/mtt_store directory:

cf_inspect $CF_STORE_LOCAL/analysis_mtt/cf.SelectEvents/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/v0/{results,columns}_0.parquet

IPython will open and load the contents of each parquet file into a local variable called objects. In the example above, we loaded the AwkwardArrays contained in two files, results.parquet and columns.parquet. We can unpack the objects variable to two separate variables for convenience, and for example look at the fields property to see what columns the arrays contain.

In [1]: results, columns = objects

In [2]: columns.fields
Out[2]:
['category_ids',
 'cutflow',
 'event',
 'luminosityBlock',
 'mc_weight',
 'process_id',
 'run']

Experiment a bit, then close the IPython session when you're done using Ctrl + D.

Histograms

To create histograms, call the cf.MergeHistograms task. You need to specify the variable(s) to create the histogram(s) for. As an example, we will use jet1_eta here, which represents the pseudorapidity of the jet with the highest transverse momentum:

law run cf.MergeHistograms --config run2_2017_nano_v9_limited --version v0 --dataset tt_dl_powheg --variables jet1_eta

We can now look at the Python histograms using cf_inspect like above. First we get the output location of the pickle files where the histograms are stored:

$> law run cf.MergeHistograms --config run2_2017_nano_v9_limited --version v0 --dataset tt_dl_powheg --variables jet1_eta --print-output 0

print task output with max_depth 0, showing schemes

file:///<path_to_storage>/cf.MergeHistograms/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/prod__default/v0/hist__jet1_eta.pickle

We can open the histogram files using cf_inspect:

cf_inspect $CF_STORE_LOCAL/analysis_mtt/cf.MergeHistograms/run2_2017_nano_v9_limited/tt_dl_powheg/nominal/calib__default/sel__default/prod__default/v0/hist__jet1_eta.pickle

Then in the IPython shell, get the histogram object from objects[0] and inspect it:

In [1]: h = objects[0]

In [2]: h
Out[2]:
Hist(
  IntCategory([0, 1.2478e+09, 1.05419e+09, 1.1463e+09], growth=True, name='category'),
  IntCategory([1200], growth=True, name='process'),
  IntCategory([0], growth=True, name='shift'),
  Variable([-2.5, -1.57, -1.44, -1.3, -0.7, 0, 0.7, 1.3, 1.44, 1.57, 2.5], name='jet1_eta', label='Jet 1 $\\eta$'),
  storage=Weight()) # Sum: WeightedSum(value=146.288, variance=563.165)

As you can see, this is a four-dimensional histogram. It stores not only the variable information, but also information about the category, process and shift. Let's ignore the latter three axes for the moment and project them out so we get a regular one-dimensional histogram. This can be done with a special syntax, for example like this:

h_1d = h[{"category": hist.loc(0), "process": 0, "shift": 0}]

You can now display the histogram with h_1d.show(). For more information on Python histograms, check out the hist package documentation at: https://hist.readthedocs.io/en/latest/user-guide/quickstart.html

When you are done experimenting, quit the IPython shell with Ctrl + D.

Plotting

Finally, we can also produce a plot of the histogram using cf.PlotVariables1D:

law run cf.PlotVariables1D --config run2_2017_nano_v9_limited --version v0 --datasets tt_dl_powheg --variables jet1_eta

Other things to try

  • plot other variables (try adding the electron pT, i.e. --variables jet1_eta,jet1_pt)
  • use more datasets (try including a data sample, i.e. --datasets tt_dl_powheg,data_mu_f)
  • use the batch system (add --workflow htcondor to the above tasks)
  • run over full statistics (remove the _limited from the config, i.e. --config run2_2017_nano_v9)
    • note: it only makes sense to use this together with --workflow htcondor, otherwise everything will run locally and will take a very long time

Resources

  • columnflow: main framework repository
  • law: workflow management system, built on top of luigi
  • luigi: base package for task specification and dependency management
  • order: Pythonic tools for organization of analysis metadata
  • awkward: numpy-like array objects for nested, variable-sized data
  • hist: Python library for working with histograms

Development