Predicting Failures of Autoscaling Distributed Applications

This replication package can be used to fully replicate the results of our paper Predicting Failures of Autoscaling Distributed Applications accepted by FSE 2024.

Introduction

Our work introduces PREFACE, an approach which combines descriptive statistics with a generative neural network (autoencoder) to reveal anomalous KPI values that are symptoms of incoming system failures, and ranks the microservices that are likely responsible for the failure. PREFACE introduces a prepocessing step exploiting descriptive statistics, to deal with time series of KPI sets with size that varies over time, as in autoscaling distributed applications.

This replication package includes:

A large dataset of KPIs collected from Alemira, a commercial Learning Managing System developed in Constructor Tech and currently in use in several educational institutions, and TrainTicket, a microservice application widely used in research projects. Both are microservice-based applications deployed on Kubernetes that takes full advantage of its autoscaling mechanisms.
The results of the experiments of PREFACE, PREdicting Failures in AutosCaling distributEd Applications, the approach presented in our manuscript which predicts and localizes failures in autoscaling distributed applications.
The toolset to execute PREFACE to replicate the results obtained based on the provided dataset.

Besides, we also implemented a set of utilities, that are excluded here for the sake of simplicity, to automate the whole process of experiments, including alemira-traffic-generator, alemira-metrics-aggregator, alemira-metrics-collector, chaos-mesh-failure-injector, train-ticket-deployment, train-ticket-hpa, train-ticket-traffic-generator, gcloud-metrics-collector and gcloud-metrics-aggregator. These utilities can be useful for those who want to replicate experiments from scratch.

Terminology

KPI: Key Performance Indicators, values of the metrics collected from the Alemira and TrainTicket systems on a microservice level.
Anomalous KPI: KPIs with a reconstruction error which is above the thershold of the KPI, calculated as three standard deviations of KPI's values on the normal dataset.
Deep Autoencoder: the component of PREFACE that identifies the anomalous KPIs by computing the reconstruction error for each KPI alongside the overall reconstruction error. The architecture (size and number of layers) and hyperparameters of the Deep Autoencoder were defined and fine-tuned during the model validation process.
Localizer: this component aggregates the score of the anomalous KPIs that belong to the same micorservice and ranks them, signaling as failing microserivces the top ranked at each timestamp for which PREFACE predicts an anomalous state.

Dataset Naming Conventions

The datasets collected during normal execution are named as normal_1_14.csv for Alemira and normal-2weeks.csv for TrainTicket. The datasets comprise the data collected over two weeks of normal execution without failures and are used to train and validate the Deep Autoencoder. The datasets include time series of KPIs collected from multiple monitoring tools (described in Section 4.1.3 in our paper).

The datasets collected during the execution with injected failures are named as linear-{failure-type}-{target service}-{unique identifier}.csv, e.g., linear-cpu-stress-ts-basic-service-020616.csv.

Replication Package Structure

In this replication package there are two folders, one for each system in our case study, named Alemira and TrainTicket. Each folder is composed as follow

dataset_tune.ipynb is the notebook responsible to tune the datasets, including alligning the failure injection dataset to the training dataset, removing columns that are constant, and columns with empty values.
data_set_normalize.ipynb is the notebook that normalizes the datasets using the min-max normalization technique.

These two last scripts are unified for Trainticket as dataset_tune_normalize.ipynb

predict.ipynb is responsible to train the Autoencoder model and generate the predictions according to its reconstruction error. This notebook also calculate the ranking of the services in order to allow the localization of the failure.
results.ipynb is used to generate the graphs and plots shown in the manuscript
input contains the folder input: this folder contains the dataset collected and needed to run the experiments. More specifically:
- datasets contains the subfolder Consolidated where all the datasets related to both the normal execution and the failure injection execution can be found. This contains the dataset needed for the training of the model.
- other contains the failure-injection-log.csv, where the information of each failure injection are stored, including Failure Type, Failure Pattern, Target Service, Beginning of the Experiment, End of the Experiment, Name of the Relative Dataset, and System Disruption Timestamp.
output contains the folder output-111 for Alemira and output-train_ticket for TrainTicket: this folder contains all the output files generated from the scripts used. This file are saved in multiple subfolder contained in output-111 and output-train_ticket. More specifically:
- datasets contains two subfolders, Tuned and Normalized. These contain the preprocessed datasets and the normalized dataset according to the min-max normalization technique respectively.
- predictions contains a .csv file for each failure injection dataset in which, for each timestamp, it stores a boolean value 1 or 0 indicating whether PREFACE predicted a failure or not.
- anomalies_list contains a .csv file for each failure injection dataset, where we stored the reconstruction error of each anomalous KPI for each timestamp. This is used for debugging purposes.
- anomalies_lists_services_only similarly, contains a .csv file for each failure injection dataset, where we stored the z-score of the reconstruction error of each anomalous KPI related to the services ranked from the biggest to the smallest.
- anomalies_lists_services_only_sliding_window as before, contains a .csv file for each failure injection dataset, where for each minute we stored the ranked z-score of the z-score of the reconstruction error of the anomalous KPIs, calculated using the 20-minutes sliding window method described in the manuscript.
- localisations_re_sliding_window includes a .csv file for each failure injection dataset, where we stored the ranking of the services using the z-score of the z-score of the reconstruction error calculated as described in the manuscript.
- models is the folder in which we store the trained Autoencoder.
- other stores a .csv file detailing the timing of each failure injection, including the Failure Injection Experiment Name, the Total Number of Timestamps, the Timestamp at which Failure Injection Started and the Timestamp at which Failure Injection Ended
- kpis_not_seen_in_prod All the files in the output folder are generated once the scripts are executed.
predict_notebook_sections folder contains some Jupyter Notebooks with functions that are used from the four main scripts described before.
functions.ipynb is a notebook containing additional useful functions used by the scripts described before.
Configs is a folder that contains the configurations needed from the scripts to run.

Quick Start

Prerequisites

To run the experiment we used a machine with the following configuration. This is a tested setup, but the scripts presented in this replication package can be run using also other OS (Windows or Linux).

OS: MacOS Catalina
Processor: 2.2 GHz 6-Core Intel Core i7
Memory: 16 GB 2400 MHz DDR4
Software packages
- Python 3
- Conda
- tar

Step 1. Extract datasets from compressed tar archives

Alemira

# Extract the consolidated dataset for 2-week normal execution of Alemira
cd {PATH TO PREFACE}/PREFACE
cd Alemira/input/datasets/Consolidated
tar -xzf normal_1_14.csv.zip

# Extract the normalized dataset for 2-week normal execution of Alemira
cd {PATH TO PREFACE}/PREFACE
cd Alemira/output/output-111/datasets/Normalized
tar -xzf normal_1_14.csv.zip

# Extract the tuned dataset for 2-week normal execution of Alemira
cd {PATH TO PREFACE}/PREFACE
cd Alemira/output/output-111/datasets/Tuned
tar -xzf normal_1_14.csv.zip

TrainTicket

# Extract the consolidated dataset for 2-week normal execution of TrainTicket
cd {PATH TO PREFACE}/PREFACE
cd TrainTicket/input/datasets/Consolidated
cat normal-2weeks.tar.gz* | tar zx

# Extract the normalized dataset for 2-week normal execution of TrainTicket
cd {PATH TO PREFACE}/PREFACE
cd TrainTicket/output/output-train_ticket/datasets/Normalized
cat normal-2weeks.tar.gz* | tar zx

# Extract the tuned dataset for 2-week normal execution of TrainTicket
cd {PATH TO PREFACE}/PREFACE
cd TrainTicket/output/output-train_ticket/datasets/Tuned
cat normal-2weeks.tar.gz* | tar zx

Step 2. Setup environment

# Create conda environment
conda create --name preface-analysis --channel conda-forge python=3.10 jupyterlab=4.2.0 numpy=1.26.4 pandas=2.2.2 scikit-learn=1.4.2 matplotlib=3.8.4 plotly=5.22.0 scipy=1.13.0 tensorflow=2.15.0 statsmodels=0.14.1 networkx=3.3

# Activate conda environment
conda activate preface-analysis

# Open jupyter notebooks in the project folder
cd {PATH TO PREFACE}/PREFACE
jupyter lab

Step 3. Run jupyter notebooks one by one

Alemira

dataset_tune.ipynb
data_set_normalize.ipynb
predict.ipynb
results.ipynb

TrainTicket

dataset_tune_normalize.ipynb
predict.ipynb
results.ipynb

The plots generated in results.ipynb should match Fig. 6 and Fig. 7 in our paper.

Experimental Procedure

Running the experiments includes multiple passages, namely: data preprocessing, predictions generation, and results visualization. We indicate details of each script as follows.

Data Preprocessing

Script: dataset_tune.ipynb

Purpose: data preprocessing
- Input: input/datasets/Consolidated - raw data with each data set consolidated in a single .csv file
- Output: output/.../datasets/Tuned - preprocessed data

Script: data_set_normalize.ipynb

Purpose: data normalization and smothing by average of 3 recent points
- Input: output/.../datasets/Tuned - preprocessed data
- Output: output/.../datasets/Normalized - normalized data

Predictions Generation

Script: predict.ipynb

Purpose: trains the model
- Input: output/.../datasets/Normalized - normalized normal data
- Output: model - trained autoencoder model
Purpose: makes and vizualize timestamp level preditions
- Input: output/.../datasets/Normalized - normalized data with injected failures
- Output: output/.../predictions
Purpose: detects KPI level anomalies
- Input: output/.../datasets/Normalized - normalized data with injected failures
- Output: output/.../anomalies_lists_services_only
- Output: output/.../anomalies_lists_services_only_sliding_window_file_path
Purpose: localize failures
- Input: output/.../anomalies_lists_services_only_sliding_window_file_path
- Output: output/.../localisations_by_reconstruction_error_sliding_window_file_path

Results Visualization

Script: results.ipynb

Purpose: vizualization of the results
- Input: output/.../predictions
- Input: output/.../localisations_re_sliding_window
- Output: visualization

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Alemira		Alemira
TrainTicket		TrainTicket
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Failures of Autoscaling Distributed Applications

Table of Contents

Introduction

Terminology

Dataset Naming Conventions

Replication Package Structure

Quick Start

Prerequisites

Step 1. Extract datasets from compressed tar archives

Alemira

TrainTicket

Step 2. Setup environment

Step 3. Run jupyter notebooks one by one

Alemira

TrainTicket

Experimental Procedure

Data Preprocessing

Predictions Generation

Results Visualization

About

Releases

Packages

Languages

License

ketaiq/PREFACE

Folders and files

Latest commit

History

Repository files navigation

Predicting Failures of Autoscaling Distributed Applications

Table of Contents

Introduction

Terminology

Dataset Naming Conventions

Replication Package Structure

Quick Start

Prerequisites

Step 1. Extract datasets from compressed tar archives

Alemira

TrainTicket

Step 2. Setup environment

Step 3. Run jupyter notebooks one by one

Alemira

TrainTicket

Experimental Procedure

Data Preprocessing

Predictions Generation

Results Visualization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages