Skip to content
/ PREFACE Public

PREFACE is an approach which combines descriptive statistics with autoencoders to reveal anomalous KPI values that are symptoms of incoming system failures, and ranks the microservices that are likely responsible for the failure.

License

Notifications You must be signed in to change notification settings

ketaiq/PREFACE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Failures of Autoscaling Distributed Applications

This replication package can be used to fully replicate the results of our paper Predicting Failures of Autoscaling Distributed Applications accepted by FSE 2024.

Table of Contents

  1. Introduction
  2. Replication Package Structure
  3. Quick Start
  4. Experimental Procedure

Introduction

Our work introduces PREFACE, an approach which combines descriptive statistics with a generative neural network (autoencoder) to reveal anomalous KPI values that are symptoms of incoming system failures, and ranks the microservices that are likely responsible for the failure. PREFACE introduces a prepocessing step exploiting descriptive statistics, to deal with time series of KPI sets with size that varies over time, as in autoscaling distributed applications.

This replication package includes:

  1. A large dataset of KPIs collected from Alemira, a commercial Learning Managing System developed in Constructor Tech and currently in use in several educational institutions, and TrainTicket, a microservice application widely used in research projects. Both are microservice-based applications deployed on Kubernetes that takes full advantage of its autoscaling mechanisms.
  2. The results of the experiments of PREFACE, PREdicting Failures in AutosCaling distributEd Applications, the approach presented in our manuscript which predicts and localizes failures in autoscaling distributed applications.
  3. The toolset to execute PREFACE to replicate the results obtained based on the provided dataset.

Besides, we also implemented a set of utilities, that are excluded here for the sake of simplicity, to automate the whole process of experiments, including alemira-traffic-generator, alemira-metrics-aggregator, alemira-metrics-collector, chaos-mesh-failure-injector, train-ticket-deployment, train-ticket-hpa, train-ticket-traffic-generator, gcloud-metrics-collector and gcloud-metrics-aggregator. These utilities can be useful for those who want to replicate experiments from scratch.

Terminology

  • KPI: Key Performance Indicators, values of the metrics collected from the Alemira and TrainTicket systems on a microservice level.
  • Anomalous KPI: KPIs with a reconstruction error which is above the thershold of the KPI, calculated as three standard deviations of KPI's values on the normal dataset.
  • Deep Autoencoder: the component of PREFACE that identifies the anomalous KPIs by computing the reconstruction error for each KPI alongside the overall reconstruction error. The architecture (size and number of layers) and hyperparameters of the Deep Autoencoder were defined and fine-tuned during the model validation process.
  • Localizer: this component aggregates the score of the anomalous KPIs that belong to the same micorservice and ranks them, signaling as failing microserivces the top ranked at each timestamp for which PREFACE predicts an anomalous state.

Dataset Naming Conventions

The datasets collected during normal execution are named as normal_1_14.csv for Alemira and normal-2weeks.csv for TrainTicket. The datasets comprise the data collected over two weeks of normal execution without failures and are used to train and validate the Deep Autoencoder. The datasets include time series of KPIs collected from multiple monitoring tools (described in Section 4.1.3 in our paper).

The datasets collected during the execution with injected failures are named as linear-{failure-type}-{target service}-{unique identifier}.csv, e.g., linear-cpu-stress-ts-basic-service-020616.csv.

Replication Package Structure

In this replication package there are two folders, one for each system in our case study, named Alemira and TrainTicket. Each folder is composed as follow

  • dataset_tune.ipynb is the notebook responsible to tune the datasets, including alligning the failure injection dataset to the training dataset, removing columns that are constant, and columns with empty values.

  • data_set_normalize.ipynb is the notebook that normalizes the datasets using the min-max normalization technique.

These two last scripts are unified for Trainticket as dataset_tune_normalize.ipynb

  • predict.ipynb is responsible to train the Autoencoder model and generate the predictions according to its reconstruction error. This notebook also calculate the ranking of the services in order to allow the localization of the failure.

  • results.ipynb is used to generate the graphs and plots shown in the manuscript

  • input contains the folder input: this folder contains the dataset collected and needed to run the experiments. More specifically:

    • datasets contains the subfolder Consolidated where all the datasets related to both the normal execution and the failure injection execution can be found. This contains the dataset needed for the training of the model.
    • other contains the failure-injection-log.csv, where the information of each failure injection are stored, including Failure Type, Failure Pattern, Target Service, Beginning of the Experiment, End of the Experiment, Name of the Relative Dataset, and System Disruption Timestamp.
  • output contains the folder output-111 for Alemira and output-train_ticket for TrainTicket: this folder contains all the output files generated from the scripts used. This file are saved in multiple subfolder contained in output-111 and output-train_ticket. More specifically:

    • datasets contains two subfolders, Tuned and Normalized. These contain the preprocessed datasets and the normalized dataset according to the min-max normalization technique respectively.
    • predictions contains a .csv file for each failure injection dataset in which, for each timestamp, it stores a boolean value 1 or 0 indicating whether PREFACE predicted a failure or not.
    • anomalies_list contains a .csv file for each failure injection dataset, where we stored the reconstruction error of each anomalous KPI for each timestamp. This is used for debugging purposes.
    • anomalies_lists_services_only similarly, contains a .csv file for each failure injection dataset, where we stored the z-score of the reconstruction error of each anomalous KPI related to the services ranked from the biggest to the smallest.
    • anomalies_lists_services_only_sliding_window as before, contains a .csv file for each failure injection dataset, where for each minute we stored the ranked z-score of the z-score of the reconstruction error of the anomalous KPIs, calculated using the 20-minutes sliding window method described in the manuscript.
    • localisations_re_sliding_window includes a .csv file for each failure injection dataset, where we stored the ranking of the services using the z-score of the z-score of the reconstruction error calculated as described in the manuscript.
    • models is the folder in which we store the trained Autoencoder.
    • other stores a .csv file detailing the timing of each failure injection, including the Failure Injection Experiment Name, the Total Number of Timestamps, the Timestamp at which Failure Injection Started and the Timestamp at which Failure Injection Ended
    • kpis_not_seen_in_prod All the files in the output folder are generated once the scripts are executed.
  • predict_notebook_sections folder contains some Jupyter Notebooks with functions that are used from the four main scripts described before.

  • functions.ipynb is a notebook containing additional useful functions used by the scripts described before.

  • Configs is a folder that contains the configurations needed from the scripts to run.

Quick Start

Prerequisites

To run the experiment we used a machine with the following configuration. This is a tested setup, but the scripts presented in this replication package can be run using also other OS (Windows or Linux).

  • OS: MacOS Catalina
  • Processor: 2.2 GHz 6-Core Intel Core i7
  • Memory: 16 GB 2400 MHz DDR4
  • Software packages

Step 1. Extract datasets from compressed tar archives

Alemira

# Extract the consolidated dataset for 2-week normal execution of Alemira
cd {PATH TO PREFACE}/PREFACE
cd Alemira/input/datasets/Consolidated
tar -xzf normal_1_14.csv.zip

# Extract the normalized dataset for 2-week normal execution of Alemira
cd {PATH TO PREFACE}/PREFACE
cd Alemira/output/output-111/datasets/Normalized
tar -xzf normal_1_14.csv.zip

# Extract the tuned dataset for 2-week normal execution of Alemira
cd {PATH TO PREFACE}/PREFACE
cd Alemira/output/output-111/datasets/Tuned
tar -xzf normal_1_14.csv.zip

TrainTicket

# Extract the consolidated dataset for 2-week normal execution of TrainTicket
cd {PATH TO PREFACE}/PREFACE
cd TrainTicket/input/datasets/Consolidated
cat normal-2weeks.tar.gz* | tar zx

# Extract the normalized dataset for 2-week normal execution of TrainTicket
cd {PATH TO PREFACE}/PREFACE
cd TrainTicket/output/output-train_ticket/datasets/Normalized
cat normal-2weeks.tar.gz* | tar zx

# Extract the tuned dataset for 2-week normal execution of TrainTicket
cd {PATH TO PREFACE}/PREFACE
cd TrainTicket/output/output-train_ticket/datasets/Tuned
cat normal-2weeks.tar.gz* | tar zx

Step 2. Setup environment

# Create conda environment
conda create --name preface-analysis --channel conda-forge python=3.10 jupyterlab=4.2.0 numpy=1.26.4 pandas=2.2.2 scikit-learn=1.4.2 matplotlib=3.8.4 plotly=5.22.0 scipy=1.13.0 tensorflow=2.15.0 statsmodels=0.14.1 networkx=3.3

# Activate conda environment
conda activate preface-analysis

# Open jupyter notebooks in the project folder
cd {PATH TO PREFACE}/PREFACE
jupyter lab

Step 3. Run jupyter notebooks one by one

Alemira

  1. dataset_tune.ipynb
  2. data_set_normalize.ipynb
  3. predict.ipynb
  4. results.ipynb

TrainTicket

  1. dataset_tune_normalize.ipynb
  2. predict.ipynb
  3. results.ipynb

The plots generated in results.ipynb should match Fig. 6 and Fig. 7 in our paper.

Experimental Procedure

Running the experiments includes multiple passages, namely: data preprocessing, predictions generation, and results visualization. We indicate details of each script as follows.

Data Preprocessing

Script: dataset_tune.ipynb

  • Purpose: data preprocessing
    • Input: input/datasets/Consolidated - raw data with each data set consolidated in a single .csv file
    • Output: output/.../datasets/Tuned - preprocessed data

Script: data_set_normalize.ipynb

  • Purpose: data normalization and smothing by average of 3 recent points
    • Input: output/.../datasets/Tuned - preprocessed data
    • Output: output/.../datasets/Normalized - normalized data

Predictions Generation

Script: predict.ipynb

  • Purpose: trains the model
    • Input: output/.../datasets/Normalized - normalized normal data
    • Output: model - trained autoencoder model
  • Purpose: makes and vizualize timestamp level preditions
    • Input: output/.../datasets/Normalized - normalized data with injected failures
    • Output: output/.../predictions
  • Purpose: detects KPI level anomalies
    • Input: output/.../datasets/Normalized - normalized data with injected failures
    • Output: output/.../anomalies_lists_services_only
    • Output: output/.../anomalies_lists_services_only_sliding_window_file_path
  • Purpose: localize failures
    • Input: output/.../anomalies_lists_services_only_sliding_window_file_path
    • Output: output/.../localisations_by_reconstruction_error_sliding_window_file_path

Results Visualization

Script: results.ipynb

  • Purpose: vizualization of the results
    • Input: output/.../predictions
    • Input: output/.../localisations_re_sliding_window
    • Output: visualization

About

PREFACE is an approach which combines descriptive statistics with autoencoders to reveal anomalous KPI values that are symptoms of incoming system failures, and ranks the microservices that are likely responsible for the failure.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published