Skip to content

demibai/INCS870-Spring2024-Team-06

 
 

Repository files navigation

Enhancing Network Intrusion Detection with Feature Selection

INCS 870, Spring 2024, Team 06, supervised by Dr. Zhida Li

Overview

This project involves the development of a machine learning pipeline for network intrusion detection. The primary goal is to classify network activities into normal or attack categories, using the UNSW-NB15 dataset. The pipeline includes data preprocessing, feature engineering, model training, hyperparameter tuning, and model evaluation.

Architecture

Our overall architecture is illustrated below:

Architecture

Prerequisites

Before running this project, ensure you have the following installed:

  • Python 3.12
  • Libraries: pandas, numpy, matplotlib, scikit-learn, xgboost

To install the required libraries, run:

pip install -r requirements.txt

Dataset

The dataset used is the UNSW-NB15, which can be downloaded from [https://research.unsw.edu.au/projects/unsw-nb15-dataset] and should be placed under /unsw_nb15. The dataset includes various features related to network traffic and a label indicating normal or attack activity.

File Structure

  • train_eval.py: Main script containing the entire machine learning pipeline, including training and evaluation.
  • unsw_nb15/: Directory containing the UNSW_NB15_*.csv files. Not included for copyright reasons.
  • figures/: Directory containing visualizations generated by the pipeline.
  • models/: Directory containing trained models.
  • reports/: Directory containing model evaluation reports.
  • requirements.txt: List of required Python libraries.
  • README.md: Project overview and usage instructions.
  • .gitignore: Files and directories to be ignored by Git.

Features

The project includes the following key components:

Usage

To run the project:

  1. Place the dataset files in the unsw_nb15/ directory.
  2. Run the train_eval.py script: python train_eval.py.

train_eval.py

This script contains the entire machine learning pipeline.
It accepts the following optional command-line arguments:

  1. pca n_components: Apply PCA with n_components. Default: None.
  2. method: Choose the method for feature selection. Options: 'rfe', 'rfecv', 'variance_threshold', 'chi2', 'anova', 'mutual_information'. Default: None.
  3. k: Argument to be passed to the feature selection method. For example, k for k-best.
  4. task: Specify which classification task to run. Options: 'binary', 'multi'. Default: 'multi'.
  5. model_path: Path to load a saved model. The program will not train a new model if this argument is provided.

run_model_batch.ps1

This is an auxiliary script that runs the train_eval.py script with different combinations of feature selection methods. It is useful for running multiple experiments in batch mode.

visualize.py

Once run, this script generates plots for the training dataset and places them in the figures/dataset_plots directory. Plots that can be generated include:

  1. Correlation matrix heatmap
  2. boxplots
  3. countplots
  4. pairplots
  5. scatterplots
  6. histograms

plot_summary.py

This script reads reports from the reports/ directory and generates visualizations for model evaluation.

Customization

The program is highly customizable and it accepts arguments as specified above. A sample command looks like this:

python train_eval.py method=rfe k=10 task=binary

Visualization

Data visualization this done through the plot_summary.py script. This script reads reports from the reports/ directory and generates visualizations for model evaluation. An example of a visualization may resemble the following, but it differs depending on the model parameters and feature selection methods used:

Visualization

Contributions

Junyi Dong: Data preprocessing, feature engineering, model training, hyperparameter tuning, model evaluation, documentation. TODO: Add other team members

License

No license

About

INCS870-Spring2024-Team-06

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 54.7%
  • Jupyter Notebook 44.1%
  • PowerShell 1.2%