Skip to content

Framework based on RDataFrames to calculated Fake Factors for tau related analyses.

Notifications You must be signed in to change notification settings

KIT-CMS/TauFakeFactors

Repository files navigation

TauFakeFactors

FakeFactor framework for the estimation of jets misidentified taus with pyROOT.

Setup

Clone the repository via

git clone --recurse-submodules https://github.com/KIT-CMS/TauFakeFactors.git

The environment can be set up with conda via

conda env create --file environment.yaml

General definitions like paths for all steps of the fake factor measurements should be defined in the configs/ANALYSIS/ERA/common_settings.yaml file.

The expected ntuple folder structure is NTUPLE_PATH/ERA/SAMPLE_TAG/CHANNEL/*.root

parameter type description
ntuple_path string absolute path to the folder with the n-tuples on the dcache, a remote path is expected like "root://cmsxrootd-kit.gridka.de//store/user/USER/..."
tree string name of the tree in the n-tuple files ("ntuple" in CROWN)
era string data taking era (e.g. "2018, "2017", "2016preVFP", "2016postVFP")
tau_vs_jet_wps list list of tau ID vsJet working points to be written out in the preselection step (e.g. ["Medium", "VVVLoose"])
tau_vs_jet_wgt_wps list list of tau ID vsJet working point scale factors to be written out in the preselection step (e.g. ["Medium"])

The output folder structure is OUTPUT_PATH/preselection/ERA/CHANNEL/*.root

parameter type description
output_path string absolute path where the files with the preselected events will be stored, a local path is expected like "/ceph/USER/..."
file_path string absolute path to the folder with the preselected files (should be the same as output_path) to be used for the fake factor calculation
workdir_name string relative path where the fake factor measurement output files will be stored; folder is produced in workdir/

Event preselection

This framework is designed for n-tuples produced with CROWN as input. All information for the preselection step is defined in configuration files in the configs/ANALYSIS/ERA/ folder.

The preselection config has the following parameters:

  • parameter type description
    channel string tau pair decay channels ("et", "mt", "tt")
  • In processes all the processes are defined that should be preprocessed.
    The names are also used for the output file naming after the processing.
    Each process needs two specifications:

    subparameter type description
    tau_gen_modes list split of the events corresponding to the origin of the hadronic tau
    samples list list of all sample tags corresponding to the specific process

    The tau_gen_modes have following modes:

    subparameter type description
    T string genuine tau
    J string jet misidentified as a tau
    L string lepton misidentified as a tau
    all string if no split should be performed
  • In event_selection, parameter for all selections that should be applied are defined.
    This is basically a dictionary of cuts where the key is the name of a cut and the value is the cut itself as a string e.g. had_tau_pt: "pt_2 > 30". The name of a cut is not really important, it is only used as an output information in the terminal. A cut can only use variables which are in the ntuples.

  • In mc_weights all weights that should be applied for simulated samples are defined.
    There are two types of weights.

    1. Similar to event_selection, a weight can directly be specified and is then applied to all samples in the same way e.g. lep_id: "id_wgt_mu_1"
    2. But some weights are either sample specific or need additional information. Currently implemented options are:
      subparameter type description
      generator string The normal generator weight is applied to all samples, if they aren't specified in the "stitching" sub-group. Stitching weights might be needed for DY+jets or W+jets, depending on which samples are used for them.
      lumi string luminosity scaling, this depends on the era and uses the era parameter of the config to get the correct weight, so basically it's not relevant what is in the string
      Z_pt_reweight string reweighting of the Z boson pt, the weight in the ntuple is used and only applied to DY+jets
      Top_pt_reweight string reweighting of the top quark pt, the weight in the ntuple is used and only applied to ttbar
  • In emb_weights all weights that should be applied for embedded samples are defined.
    Like for event_selection a weight can directly be specified and is then applied to all samples the same way e.g. single_trigger: "trg_wgt_single_mu24ormu27"

  • In output_features the to be saved/needed features for the later calculations are listed.

Scale factors for b-tagging and tau ID vs jet are applied on the fly during the FF calculation step.

To run the preselection step, execute the python script and specify the config file (relative path possible):

python preselection.py --config-file configs/PATH/CONFIG.yaml

Further there are additional optional parameters:

  1. --nthreads=SOME_INTEGER to define the number of threads for the multiprocessing pool to run the sample processing in parallel. Default value is 8 (this should normally cover running all of the samples in parallel).
  2. --ncores=SOME_INTEGER to define the number of cores that should be used for each pool thread to speed up the ROOT dataframe calculation. Default value is 4.

Fake Factor calculation

In this step the fake factors are calculated. This should be run after the preselection step.

All information for the FF calculation step is defined in a configuration file in the configs/ folder.
The FF calculation config has the following parameters:

  • The expected input folder structure is FILE_PATH/preselection/ERA/CHANNEL/*.root

    parameter type description
    file_path string absolute path to the folder with the preselected files
    era string data taking era ("2018, "2017", "2016preVFP", "2016postVFP")
    channel string tau pair decay channels ("et", "mt", "tt")
    tree string name of the tree in the preselected files (same as in preselection e.g. "ntuple")
  • The output folder structure is workdir/WORKDIR_NAME/ERA/fake_factors/CHANNEL/outputfiles

    parameter type description
    workdir_name string relative path where the output files will be stored
  • General options for the calculation:

    parameter type description
    use_embedding bool True if embedded sample should be used, False if only MC sample should be used
  • In target_processes the processes for which FFs should be calculated (normally for QCD, Wjets, ttbar) are defined.
    Each target process needs some specifications:

    parameter type description
    split_categories dict names of variables for the fake factor measurement in different phase space regions
    • the FF measurement can be split based on variables in 1D or 2D (1 or 2 variables)
    • each category/variable has a list of orthogonal cuts (e.g. "njets" with "==1", ">=2")
    • implemented split variables are "njets", "nbtag" or "deltaR_ditaupair"
    • at least one inclusive category needs to be specified
    split_categories_binedges dict bin edge values for each split_categories variable
    • number of bin edges should always be N(variable cuts)+1
    SRlike_cuts dict event selections for the signal-like region of the target process
    ARlike_cuts dict event selections for the application-like region of the target process
    SR_cuts dict event selections for the signal region (normally only needed for ttbar)
    AR_cuts dict event selections for the application region (normally only needed for ttbar)
    var_dependence string variable the FF measurement should depend on (normally pt of the hadronic tau e.g. "pt_2")
    var_bins list bin edges for the variable specified in var_dependence

    Event selections can be defined the same way as in the preselection step event_selection. Only the tau vs jet ID cut is special because the name should always be had_tau_id_vs_jet (or had_tau_id_vs_jet_* in tt channel), this is needed to read out the working points from the cut string and apply the correct tau vs jet ID weights.

  • In process_fractions specifications for the calculation of the process fractions are defined.

    parameter type description
    processes list sample names (from the preprocessing step) of the processes for which the fractions should be stored in the correctionlib json, the sum of fractions of the specified samples is 1.
    split_categories dict see target_processes (only in 1D)
    AR_cuts list see target_processes
    SR_cuts list see target_processes, (optional) not needed for the fraction calculation

To run the FF calculation step, execute the python script and specify the config file (relative path possible):

python ff_calculation.py --config-file PATH/CONFIG.yaml

Fake Factor corrections

In this step the corrections for the fake factors are calculated. This should be run after the FF calculation step.

Currently two different correction types are implemented:

  1. non closure correction depending on a specific variable
  2. DR to SR interpolation correction depending on a specific variable

All information for the FF correction calculation step is defined in a configuration file in the configs/ folder. Additional information is loaded from the used config in the previous FF calculation step (this is done automatically).
The FF correction config has the following parameters:

  • The expected input folder structure is workdir/WORKDIR_NAME/ERA/fake_factors/CHANNEL/*

    parameter type description
    workdir_name string the name of the work directory for which the corrections should be calculated (normally the same as in the FF calculation step)
    era string data taking era ("2018, "2017", "2016preVFP", "2016postVFP")
    channel string tau pair decay channels ("et", "mt", "tt")
  • In target_processes the processes for which FF corrections should be calculated (normally for QCD, Wjets, ttbar) are defined.
    Each target process needs some specifications:

    parameter type description
    non_closure dict one or two non closure corrections can be specified indicated by the variable the correction should be calculated for (e.g. leading_lep_pt), if more than one correction is specified, leading_lep_pt should come first (due to code specifics) because the second corrections is calculated with the first already applied
    DR_SR dict this correction should be specified only once per process in target_processes

    Each correction has following specifications:

    parameter type description
    var_dependence string variable the FF correction measurement should depend on (e.g. "pt_1" for "leading_lep_pt")
    var_bins list bin edges for the variable specified in var_dependence
    SRlike_cuts dict event selections for the signal-like region of the target process that should be replaced compared to the selection used in the previous FF calculation step
    ARlike_cuts dict event selections for the application-like region of the target process that should be replaced compared to the selection used in the previous FF calculation step
    AR_SR_cuts dict event selections for a switch from the determination region to the signal/application region, this is only relevant for DR_SR corrections
    non_closure dict this is only relevant for DR_SR corrections, since for this corrections additional fake factors are calculated it's possible to calculated and apply non closure corrections to these fake factors before calculating the actual DR to SR correction

To run the FF correction step, execute the python script and specify the config file (relative path possible):

python ff_corrections.py --config-file PATH/CONFIG.yaml 

There are two optional parameters --skip-DRtoSR-ffs and --only-main-corrections. The correction caclulation is done in 3 steps.
The first step is to calculate additional fake factors which are needed for the final DR to SR correction. If this is already done, this step can be skipped using --skip-DRtoSR-ffs.
The second step is to calculate non closure corrections for these additional DR to SR fake factors. If both steps are already done they can be skipped by using --only-main-corrections.
The last step is to calculate all the specified corrections for the main fake factors.

Hints

  • check out configs/general_definitions.py, this file has many relevant definition for plotting (dictionaries for names) and correctionlib output information
  • check ntuple_path and output_path (preselection) and file_path and workdir_name (fake factors, corrections) in the used config files to avoid wrong inputs or outputs

About

Framework based on RDataFrames to calculated Fake Factors for tau related analyses.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages