Skip to content

isneslab/mlsec-labs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Security Labs

Setup

It is better to use Python 3.10 as we are still figuring out a compatibility with the Tesseract library.

I would also advise to create a Python virtual environment for these labs, using Python 3.10: see here for a guide on virtual environments.

Labs overview

This github workspace contains some example to get acquainted with the use of Machine Learning for Systems Security and Malware Detection.

  • Lab 01: Malware detection with Machine Learning. This lab is a warmer to introduce on the use of notebooks, and to compute the main performance metrics. Here, you will see how to embed some simple features into vector format, and how to perform classification tasks on datasets of increasing complexity.

  • Lab 02: Time-aware evaluations. This lab introduces the use of time-aware evaluations: to evaluate performance decay of classifiers over time, as well as mitigation strategies including active learning and classification with rejection. You will have to install the Tesseract Library (see instructions below).

  • Lab 03: Adversarial Attacks. In this lab, you will learn how to generate "security evaluation curves" for adversarial attacks against a simple linear classifier. You will start with simple weight-driven attack for the linear SVM classifier on the DREBIN feature space, which you may compare and expand against PGD of the secml library.

  • Lab 04: The impact of sampling bias. There are more subtle aspects that may affect reliability of classification. In this lab, you will evaluate how considering different subsampling strategies may lead to inflated performance. For example, the origin marketplace (e.g., "Google Play") of an app plays a role in detection accuracy. This lab is related to the Android malware experiment in the "Dos and Don'ts of Machine Learning in Computer Security" paper.

The datasets folder contains simple datasets and the instruction to download a larger dataset based on the DREBIN (NDSS 2014) feature space.

Tesseract Library

In case you need to do time-aware evaluations with:

To install, create a Python 3.10 environment. If the instructions of the repo do now work, consider trying:

python -m build

To register the virtual environment on a Python notebook:

python -m ipykernel install --user --name <env-name>

where the variable matches the name of the environment.

Note: do NOT install from pip, because that is a different Tesseract library.

You can refer to this publication:

Android Malware Dataset (with DREBIN Feature Space)

These are preprocessed datasets that are already converted into feature matrices. They use the DREBIN feature abstraction.

As you download the folder, you will have extended-features.zip. If you unzip it, you will have a folder extended-features with the following files:

  • X.json: the feature matrix (remember to drop the sha256 column before running classification tasks)
  • y.json: the labels (0 for goodware, 1 for malware)
  • meta.json: contains some metadata information, including timestamps,
  • X-10k.p and f-10k.p are smaller versions with the top-10k features (f.p is the vector of feature names)

You should put these files in datasets/drebin/ folder in this repository. You can rename the extended-features folder into drebin.

The Python functions for loading the dataset are in libs.utils.

The features correspond to the apps from 2014--2018 used in this paper:

To understand the meaning of the features, read the feature extraction process here:

About

Exercises for practicing MLSec for Systems Security

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages