Log Representation - Supplimental Materials

The repository contains the detailed results and replication package for the paper "On the Effectiveness of Log Representation for Log-based Anomaly Detection".

Introduction

The overall framework of our experiments and our research questions:

We organize the this repository into following folders:

'models' contains our studied anomaly detection models, both traditional and deep-learning models.
- Traditional models (i.e., SVM, decision tree, logistic regression, random forest)
- Deep-learning models (i.e., MLP, CNN, LSTM)
'logrep' contains the codes we used to generated all the studied log representations.
- Feature generation
- Feature aggregation (from event-level to sequence-level)
'results' contains the experimental results which are not listed on the paper as the space limit.
- Results for RQ1, RQ2, RQ3

Dependencies

We recommend using an Anaconda environment with Python version 3.9, and following Python requirement should be met.

Numpy 1.20.3
PyTorch 1.10.1
Sklearn 0.24.2

Dataset

Source

We use HDFS, BGL, Spirit and Thunderbird datasets. Original datasets are accessed from LogHub project. (We do not provide generated log representations as they are in huge size. Please generate them with our codes provided.)

Due to computational limitations, we utilized subsets of the Spirit and Thunderbird datasets in our experiments. These subsets are available for access at Zenodo.

Extra regular expression parsed to the Drain parser

We used Drain to parse the studied datasets. We adopted the default parameters from the following paper for parsing.

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. Drain: An Online Log Parsing Approach with Fixed Depth Tree, Proceedings of the 24th International Conference on Web Services (ICWS), 2017.

However, Drain parser generated too much templates with the default setting due to the failure of spotting some dynamic fields. We passed the following regular expression to reduce the amount.

For BGL dataset:

For configuration used in our experiment:

regex      = [r'core\.\d+',
              r'(?<=r)\d{1,2}',
              r'(?<=fpr)\d{1,2}',
              r'(0x)?[0-9a-fA-F]{8}',
              r'(?<=\.\.)0[xX][0-9a-fA-F]+',
              r'(?<=\.\.)\d+(?!x)',
              r'\d+(?=:)',
              r'^\d+$',  #only numbers
              r'(?<=\=)\d+(?!x)',
              r'(?<=\=)0[xX][0-9a-fA-F]+',
              r'(?<=\ )[A-Z][\+|\-](?= |$)',
              r'(?<=:\ )[A-Z](?= |$)',
              r'(?<=\ [A-Z]\ )[A-Z](?= |$)'
              ]

We refined the RegExps for more accurate parsing as follows:

              r'core\.\d+',
              r'(?<=:)(\ [A-Z][+-]?)+(?![a-z])', # match X+ A C Y+......
              r'(?<=r)\d{1,2}',
              r'(?<=fpr)\d{1,2}',
              r'(0x)?[0-9a-fA-F]{8}',
              r'(?<=\.\.)0[xX][0-9a-fA-F]+',
              r'(?<=\.\.)\d+(?!x)',
              r'\d+(?=:)',
              r'^\d+$',  #only numbers
              r'(?<=\=)\d+(?!x)',
              r'(?<=\=)0[xX][0-9a-fA-F]+'  # for hexadecimal

For Spirit dataset:

regex      = [r'^\d+$',  #only numbers
              r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}[^0-9]',   # IP address
              r'^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$',   # MAC address
              r'\d{14}(.)[0-9A-Z]{10,}',   # message id
              r'(?<=@#)(?<=#)\d+',   #  message id special format
              r'[0-9A-Z]{10,}', # id
              r'(?<=:|=)(\d|\w+)(?=>|,| |$|\\)'   # parameter after:|=
             ]

For Thunderbird dataset:

regex      = [
             r'(\d+\.){3}\d+',
             r'((a|b|c|d)n(\d){2,}\ ?)+', # a|b|c|dn+number
             r'\d{14}(.)[0-9A-Z]{10,}@tbird-#\d+#', # message id
             r'(?![0-9]+\W)(?![a-zA-Z]+\W)(?<!_|\w)[0-9A-Za-z]{8,}(?!_)',      # char+numbers,
             r'(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)', # ip address
             r'\d{8,}',   # numbers + 8
             r'(?<=:)(\d+)(?= )',    # parameter after :
             r'(?<=pid=)(\d+)(?= )',   # pid=XXXXX
             r'(?<=Lustre: )(\d+)(?=:)', # Lustre:
             r'(?<=,)(\d+)(?=\))'
             ]

Experiments

The general process to replicate our results is:

Generate structured parsed dataset using loglizer with Drain parser into JSON format.
Split the dataset into training and testing set and save as NPZ format, with x_train, y_train, x_test, y_test.
Generate selected log representations with corresponding codes within the logrep folder, and generates representations and save as NPY or NPZ format.
If the studied technique generates event-level representations, use the aggregation.py in the logrep folder to merge them into sequence-level for the models that demand sequence-level input.
Load generated representations and corresponding labels, and run the models within the models folder to get the results.

Sample parsed data and splitted data are provided in samples folder.

Network details for CNN and LSTM

CNN

Layer	Parameters	Output
Input	win_size Embeddin_size*	N/A
FC	Embedding_size 50*	Win_size 50*
Conv 1	kernel_size=[3, 50], stride=[1, 1], padding=valid, MaxPool2D:[𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 3, 1], LeakyReLU	50 1 * 1*
Conv 2	kernel_size=[4, 50], stride=[1, 1], padding=valid, MaxPool2D: [𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 3, 1], LeakyReLU	50 1 * 1*
Conv 3	kernel_size=[5, 50], stride=[1, 1], padding=valid, MaxPool2D:[𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 4, 1], LeakyReLU	50 1 * 1*
Concat	Concatenate feature maps of Conv1, Conv2, Conv3, Dropout(0.5)	150 1 * 1*
FC	[150 2]*	$2$
Output	Softmax

LSTM

Layer	Parameters	Output
Input	[win_size Embedding_size]*	N/A
LSTM	Hidden_dim = 8	Embedding_size 8*
FC	[8 2]*	2
Output	Softmax

Acknowledgements

Our implimentation bases on or contains many references to following repositories:

Citing & Contacts

Please cite our work if you find it helpful to your research.

Wu, X., Li, H. & Khomh, F. On the effectiveness of log representation for log-based anomaly detection. Empir Software Eng 28, 137 (2023). https://doi.org/10.1007/s10664-023-10364-1

@article{article,
year = {2023},
month = {10},
pages = {},
title = {On the effectiveness of log representation for log-based anomaly detection},
volume = {28},
journal = {Empirical Software Engineering},
doi = {10.1007/s10664-023-10364-1}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
logrep		logrep
models		models
results		results
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Log Representation - Supplimental Materials

Introduction

Dependencies

Dataset

Source

Extra regular expression parsed to the Drain parser

Experiments

Network details for CNN and LSTM

CNN

LSTM

Acknowledgements

Citing & Contacts

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mooselab/suppmaterial-LogRepForAnomalyDetection

Folders and files

Latest commit

History

Repository files navigation

Log Representation - Supplimental Materials

Introduction

Dependencies

Dataset

Source

Extra regular expression parsed to the Drain parser

Experiments

Network details for CNN and LSTM

CNN

LSTM

Acknowledgements

Citing & Contacts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages