Method for model prediction and evaluation of DRAM Uncorrected Errors (UEs).
Supplemental code for the SC20 paper "Cost-Aware Prediction of Uncorrected DRAM Errors in the Field".
The project is structured as follows:
- The folder ue_predict is a module containing the necessary files and methods for training and evaluation.
- The scripts train_test.py and evaluation.py provide a way of executing the 'ue_predict' module by allowing the configuration of specific parameters, such as the length of the prediction window or the prediction frequency.
- The folder data contains the files needed for training and evaluation.
- The folder synthetic_ues contains a synthetic log file of UEs based on the production MareNostrum 3 error logs. We provide this log file in order to enable future studies to quantify the real-world impact of DRAM uncorrected errors and any proposed resiliency techniques.
The UEPREDICT code is released under the BSD-3 License.
Scripts can be executed directly from the shell as python 3 files. First, execute the train_test.py in order to compute the predictions using walk-forward validation. Scripts' arguments are discribed in next section.
python3 train_test.py --verbose
If executed with the verbose
argument, it prints information at each training/testing step, such as the confusion matrix values for train and test sets or the best hyperparamters.
After its execution, it will generate a file with the probabilities calculated by the model of each instance belonging to class 1 (i.e. having an Uncorrected Error in the next prediction window), alongside the correct class label. This file is stored as data/predictions.csv by default.
Once the predictions
file is generated, it can be evaluated by executing the evaluation.py script, which gives information such as the number of impact mitigations performed or the number of correctly predicted UEs.
python3 evaluation.py --verbose
Switch | Long switch | Description |
---|---|---|
-pw | --prediction-window | Specifies the prediction window. It has to be defined in the form of pandas time offset aliases, see offset aliases for more information. |
-pf | --prediction-frequency | Specifies the prediction frequency. It has to be defined in the form of pandas time offset aliases, see offset aliases for more information. |
-tf | --train-frequency | Specifies the training frequency. It has to be defined in the form of pandas time offset aliases, see offset aliases for more information. |
-i-fts | --input-features | Path to the input features file, a CSV containing the features data. |
-i-ues | --input-ues | Path to the input Uncorrected Errors file, a CSV containing the UEs data. |
-o | --output-predictions | Path of the output file, a CSV containing the predictions data after the train/test iterations. |
-clf | --classifier | Classifier to use for training and predicting. Choices: RF, GBDT, LR, GNB, SVM or NN. |
-ru | --random-undersampling | Ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. The ratio is expressed as Nm/NrM, where Nm is the number of samples in the minority class and NrM is the number of samples in the majority class after resampling. If the ratio is 0, do not sample. |
--verbose | If specified, shows information during the execution, such as performance, training times, etc. at each train/test split. |
Switch | Long switch | Description |
---|---|---|
-pw | --prediction-window | Specifies the prediction window. It has to be defined in the form of pandas time offset aliases, see offset aliases for more information. |
--mitigation-time | Specifies the time needed for performing an impact mitigation. It has to be defined in the form of pandas time offset aliases, see offset aliases for more information. | |
-i-preds | --input-predictions | Path to the input predictions file, a CSV containing the predictions data generated by the train_test script. |
-i-ues | --input-ues | Path to the input Uncorrected Errors file, a CSV containing the UEs data. |
-o | --output-evaluations | Path of the output file, a CSV containing the evaluations data. |
--verbose | If specified, shows the information regarding split the evaluation of the predictions. |
This work was supported by the Spanish Ministry of Science and Technology (project PID2019-107255GB), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the European Union’s Horizon 2020 research and innovation programme and EuroEXA project (grant agreement No 754337). Paul Carpenter and Marc Casas hold the Ramon y Cajal fellowship under contracts RYC2018-025628-I and RYC2017-23269, respectively, of the Ministry of Economy and Competitiveness of Spain.