Source code of the project titled: "Using Machine Learning for Particle Tracking at the Large Hadron Collider" of the ENLACE 2023 Summer Camp at UCSD.
This project was made for the ENLACE 2023 Research Summer Camp at the UCSD in a timeframe of 7 weeks and it's results were to be commited into a poster (available in the repo) as part of the requirements for the University Students' projects. Most of the code is developed with the Pytorch module.
In the realm of particle physics, the Large Hadron Collider (LHC) stands as a colossal accelerator in Geneva, Switzerland, with its intricate network of superconducting magnets propelling particles to immense energies for experimental collisions. Within the LHC, the Compact Muon Solenoid (CMS) experiment captures the paths of charged particles through a powerful magnetic field, aiming to distinguish accurate tracks amidst the complex particle interactions. Addressing the challenge of efficient track identification, the Line Segment Tracking (LST) algorithm emerges as a solution, reconstructing particle trajectories piece-by-piece, forming linear segments. Notably, LST's modular approach allows for parallelizability, a crucial attribute in tackling the intricate scenarios posed by the forthcoming High-Luminosity LHC (HL-LHC). While LST thrives in parallel processing, it faces limitations in handling increasingly complex scenarios sequentially, thereby highlighting the imperative of harnessing the power of Machine Learning (ML) techniques. This pivotal role of ML is exemplified in our architecture, which leverages Deep Neural Networks (DNNs) with varying hidden layer sizes to process Linear Segments (LS), culminating in an output neuron discerning the authenticity of the track. The convergence of the loss function during training, influenced by the hidden layer size and model hyperparameters, underscores the symbiotic relationship between advanced ML and the progressive analysis of particle tracks.
As per the poster data, the training aspect of the model consisted in training two types of DNNs:
- Small DNN: A DNN where its arquitechture consists of 2 hidden layers of 32 neurons each and was trained with a learning rate of 0.002, a batch size of 1000 data entries per batch and for 50 epochs.
- Big DNN: A DNN wherre its arquitecture consists of 2 hidden layers of 200 neurons each and was trained with a learning rate of 0.002, a batch size of 1000 data entries per batch and for 100 epochs.
The focus of our results will be on those obtained with the aforementioned Big DNN. The loss curve plot of both the training and testing datasets indicates that the model is indeed learning patters in the training dataset that are applicable to those in the testing dataset and any other dataset for that matter; because both curves are trending downwards. If both curves where to diverge in any point in the epochs, we say taht the model is overfitting; simply put, that the model learned too much to the point that it became quite specialized in detecting patters of the training dataset only.
The prediction scores histogram indicates us at a glance that the vast mayority of data entries of our testing dataset are indeed labeled as Fake and that the model is predicting them as such, hence the distribution of Fake LS near the origin. On a similar fashion, the rest of the LS that are labeled as Real are distributed to the far right, indicating that the model indeed is predicting those Real LS as Real. Nevertheless, we can appreciate a little overlap of Real LS on top of the Fake LS of the far left of the plot, which means that the model mispredicted certain LS that are labeled as Real, and classified them as Fake; this a given with models related to Binary Classification.
On a related matter to the histogram overlaps, a better way to understand the rate at which is expected that the model will make this mispredictions iw with the help of a ROC Curve, which at a glance is telling us if the model is peforming accurate estimations of the Real LS (TPR ) in contrast to those Fake LS that are mispredicted as Real (FPR). For the context of the comparison of the GNN vs the DNN, this curve tell us exactly by how much the model's peformance is similar to one another. On the first ROC Curve, we can observe tht the peformance of the Small DNN is actually worse in comparison to the GNN, but in the case of the Big DNN versus the GNN, we observe that this DNN is doing almost the same work with a simpler arquitechture in contrast to the complex nature of the GNN, which for technical reasos such as the time of training and the time of developing the model pipeline, the Big DNN is better for our sake and purposes of classifying LS.
Continuing with the same plot, we plotted two square dots to further reference the estimated coordinates of where we could find a TPR > 0.95, and a TPR > 0.99 respectively, which in turn, can tell us the actual threshold to use for the model to actually comply with these TPR ranges. The followup tables contain the distilled numbers of the total of LS that both the DNN and GNN classified given their respective threshold that satisfy the previous TPR boundaries.
On both Table 1 and Table 2 (also included in the poster) we observe that we get a similar distribution (Predicted Scores Histogram) of data that was predicted as Real and Fake, the interesing detail is that, using the same testing dataset for both inferences with the DNN and the GNN, these models are using a substantial amount of the same LS for their predictions.
DNN > X | GNN > Y | Both | |
---|---|---|---|
Real & Fake | 134876 | 128475 | 106670 |
Real | 49628 | 49628 | 48966 |
Fake | 85248 | 78847 | 57704 |
Table 1. Table of LS selected for a TPR > 0.95. Note: X = 0.0328, Y = 0.0385.
DNN > X | GNN > Y | Both | |
---|---|---|---|
Real & Fake | 302497 | 313556 | 245207 |
Real | 51716 | 51717 | 51446 |
Fake | 250781 | 261839 | 193761 |
Table 2. Table of LS selected for a TPR > 0.99. Note: X = 0.0032, Y = 0.0044.
- Alejandro Dennis
- Abraham Flores
- Jonathan Guiang (mentor)
- Frank Wuerthwein (PI)