AXA Telematics Challenge: Use of Autoencoders

Introduction:

This work investigates the performance of autoencoders in the domain of unsupervised anomaly detection. It explores the implementation of the algorithm to a publicly available dataset. Anomaly detection is one of the most popular topics in the field of machine learning. The goal of the algorithms designed for anomaly detection are concerned with finding data in a dataset that does not conform to a certain pattern. The idea behind this is to identify data points that are special in regards to their behavior, compared to data points that are considered to be ‘normal’.

Dataset:

The AXA Driver Telematics dataset is a dataset that was released to the public in the form of a Kaggle challenge. It contains the logs of 2736 drivers. There is one designated folder for each driver, which contains 200 different trips in the form of CSV files. A single trip consists of a CSV file, consisting of two columns (representing x and y coordinates) and varying number of rows. Each row represents the driver’s position one second after the previous row. Each trip is anonymized, such that each trip starts at (x,y) = (0,0) and all the coordinates have been randomly flipped.

The key to finding the outliers is based on the fact that a varying and unknown number of trips have been added to folders generated by other drivers. The dataset is considered to be reasonably large with a size of 1.44 GB in compressed state. The reason for identifying anomalies in driving patterns

Preparation of Data:

The features need to be extracted that describe the driving behavior of the driver and also the road conditions. During this work, I have used 12 features. These have been stated below. For every driver, a feature matrix of 200 X 12 would be fed into the input layer of the autoencoder. The training/test split is 70/30.

• Length of each trip • Average velocity over each trip • Percentile velocity (Percentile: 5,10,50,75,85,95) • Count of number of stops before a certain distance threshold • Count of number of stops before a certain velocity threshold • Ratio of total number of stops over the length of a particular trip • Acceleration • Percentile of Acceleration • Heading angle • Percentile of Heading angle • Turning aggression (Speed*Heading angle) • Percentile of turning aggression

Trip Segment Features:

Another idea with regard to trip segment features is to identify common road segments between trips. This approach would help identify redundant trips. The Ramer-Douglas-Peucker algorithm would help in simplifying trajectories. However, due to time constraints, I have not been able to implement the algorithm right now.

Implementation:

Autoencoder:

It is a class of neural networks that are designed to perform dimensionality reduction and manifold learning. It helps in better representing unlabeled data. The training is done in a supervised manner. The simplest autoencoder consists of an input layer, hidden layer and output layer. It works on the intuition that the performance of the machine-learning algorithm is dependent on the features it is being applied to.

Stacked Autoencoder:

It consists of multiple layers of autoencoders in which the outputs of each layer are wired to the successive layer. It consists of a wider class of functions, which help, in dimensionality reduction. Each layer is trained at a time.

Deep Learning Framework: Theano

Loss function: Mean Squared Error (MSE)

Activation Function used:

(1) Sigmoid activation function: I would generally be using the sigmoid activation function. As it is easy to implement, and Theano has a built in function for this particular activation function.

(2) ReLu Activation function: This activation function is used when the number of hidden layers increases. It helps eradicate the vanishing gradient problem.

Algorithm for training the autoencoder:

(1) Initialize no. of epochs, learning rate

(2) Initialize loss function

(3) epoch = 0

(4) While (epoch < no. of epochs)

(5) epoch = epoch + 1

(6) for minibatch = 1 to N

(7) compute MSE

(8) update parameters with respect to MSE using stochastic gradient

(9) Update learning rate and momentum

(10) Store weights and biases for test set

Evaluation of Results:

            Testing out Autoencoder model:

• The visualization of weights (unlike in the case of MNIST) is not possible in this case. Also, it is not possible to submit the results to Kaggle, to determine AUC. This is due to the time constraint

• The L2 norm of the reconstruction error for each trip of the training set is computed. The same procedure is carried out for the test set.

• The assessment of the L2 norms of the training and test set is done using Kolmogorov-Smirnov (KS) Test.

• The KS test gives a good indicator about the relationship between the distribution of the training and the test set. If the null hypothesis is not rejected, then the distribution may be the same. This indicates the algorithm may have generalized well. Otherwise, over-fitting may have occurred.

Anomaly Detection:

To detect anomalies: • Train model properly. Mean reconstruction error should be reducing with no of epochs.

• Plot the reconstruction error using matplotlib.

• In the case of each driver, there are 200 trips. Filter out approximately 15 trips with highest reconstruction error. This helps in detecting a threshold.

• Use the same threshold to detect anomalies in the test. Using the KS test, we have already established that our autoencoder generalizes well.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
autoencoder.py		autoencoder.py
genFeature.py		genFeature.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AXA Telematics Challenge: Use of Autoencoders

Introduction:

Dataset:

Preparation of Data:

Trip Segment Features:

Implementation:

Autoencoder:

Stacked Autoencoder:

Algorithm for training the autoencoder:

Evaluation of Results:

About

Releases

Packages

Languages

arjun180/AXA_Telematics-Kaggle-Challenge

Folders and files

Latest commit

History

Repository files navigation

AXA Telematics Challenge: Use of Autoencoders

Introduction:

Dataset:

Preparation of Data:

Trip Segment Features:

Implementation:

Autoencoder:

Stacked Autoencoder:

Algorithm for training the autoencoder:

Evaluation of Results:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages