Skip to content

Example

Fabio Catalano edited this page Dec 18, 2018 · 21 revisions

Perform a heavy-ion optimisation!

In this tutorial you will learn how to optimize the selection of a rare particle produced in proton-proton and heavy-ion collisions at the Large Hadron Collider at CERN. The goal of the exercise is to measure the production rate of this particle, which carries very fundamental information about the Quark-gluon plasma.

The challenge: separate signal from background!

Λc mesons are short lived particles which decay, few fractions of a second after their production, very close to the main interaction point of the collision. Experimentally they can be identified by looking at their decay products. The main challenge for us is being able to discriminate between:

  • real (or signal) candidates
  • fake (background) candidates, not coming from a real decay, but produced in the random association of uncorrelated particles

As shown in the Figure above, the signal component is visible as a peak in a histogram of invariant mass. We want to enhance this peak, and in particular increase the signal over background ratio of this signal.

Selection variables (or features) can be used to discriminate between signal and background candidates. As shown in the plot below, the distributions of the selection variables can be significantly different for signal and background candidates.

We want to define an optimal selection, which makes use of all the useful selection variables to increase the purity of our signal sample. The complete list of selection variables for this analysis (and for all the others) can be found in our database file Λc variables

How can we isolate signal and background for the training?

Since we extract the amount of signal on a statistical basis (via a fit) we cannot know, in the data, if one candidate is a signal or a background candidate. We can however use a trick to prepare our machine learning sample. We therefore get signal events from dedicated simulations where we identify signal candidates. On the other side, we get background candidates from real data after excluding the region where we can have real signals (green regions in the figure).

Get some input data!

Download from lxplus the two files that contain data and Monte Carlo Λc candidates in proton-proton collisions collected with the ALICE detector at CERN. From the main folder of the repository, execute the following lines replacing <my_cern_user> with your NICE name :

cd machine_learning_hep/data
mkdir inputroot
scp <my_cern_user>@lxplus.cern.ch:/afs/cern.ch/work/g/ginnocen/public/exampleInputML/*.root ./inputroot/

If you don't have a lxplus account, you can find the same files in this dropbox folder: inputroot

Run the optimisation

The file doclassification_regression.py in the folder machine_learning_hep is the main script you will use to perform the analysis. This macro provides several functionalities.

Choose your classification problem

You can select the type of optimisation problem you want to perform. In our case we will keep the default values, which are the one needed for doing the Λc study.

mltype = "BinaryClassification"
mlsubtype = "HFmeson"
case = "Lc"

Choose the transverse momentum region

You can select the transverse momentum region you want to consider in the optimisation. In our case we will focus one range from 2 to 4 GeV/c as in the default settings.

var_skimming = ["pt_cand_ML"]
varmin = [2]
varmax = [4]

Choose the number of signal and background candidates you want to use for the optimisation:

As it will be described later we need to define a training sample of pure signal and background candidates. The larger the number of candidates we will consider the more accurate (up to a certain point!) the optimisation will be. I would suggest to start with the default settings and increase it according to the computing power of your machine.

nevt_sig = 1000
nevt_bkg = 1000

Prepare your training and testing sample :

By setting the parameter

loadsampleoption = 1

you prepare the ML sample. In our analysis case, signal candidates will be taken from Monte-Carlo simulations and 1000 background candidates will be taken from data in a region of mass where no signal is present (called side-band regions).

Prepare your training and testing sample :

By activating one (or more) of these bit you will activate different types of algorithms.

activate_scikit = 0
activate_xgboost = 1
activate_keras = 0

For a first look, we suggest to use the XGBoost algorithm, which is the fastest.

Do the training and the testing:

By activating this two bits you will tell the script to run the training and the testing of your algorithms. In the training step, the trained models will be saved locally. In the testing step, a dataframe and a new TTree with the probabilities obtained for each algorithm for each candidate will be saved.

dotraining = 1
dotesting = 1

Validation tools:

A long list of validation utilities, which includes score cross validation, ROC curves, learning curves, feature importance can be activated using the following bits:

docrossvalidation = 1
dolearningcurve = 1
doROC = 1
doboundary = 1
doimportance = 1