Enhancing Network Intrusion Detection with Feature Selection

INCS 870, Spring 2024, Team 06, supervised by Dr. Zhida Li

Overview

This project involves the development of a machine learning pipeline for network intrusion detection. The primary goal is to classify network activities into normal or attack categories, using the UNSW-NB15 dataset. The pipeline includes data preprocessing, feature engineering, model training, hyperparameter tuning, and model evaluation.

Architecture

Our overall architecture is illustrated below:

Prerequisites

Before running this project, ensure you have the following installed:

Python 3.12
Libraries: pandas, numpy, matplotlib, scikit-learn, xgboost

To install the required libraries, run:

pip install -r requirements.txt

Dataset

The dataset used is the UNSW-NB15, which can be downloaded from [https://research.unsw.edu.au/projects/unsw-nb15-dataset] and should be placed under /unsw_nb15. The dataset includes various features related to network traffic and a label indicating normal or attack activity.

File Structure

train_eval.py: Main script containing the entire machine learning pipeline, including training and evaluation.
unsw_nb15/: Directory containing the UNSW_NB15_*.csv files. Not included for copyright reasons.
figures/: Directory containing visualizations generated by the pipeline.
models/: Directory containing trained models.
reports/: Directory containing model evaluation reports.
requirements.txt: List of required Python libraries.
README.md: Project overview and usage instructions.
.gitignore: Files and directories to be ignored by Git.

Features

The project includes the following key components:

Usage

To run the project:

Place the dataset files in the unsw_nb15/ directory.
Run the train_eval.py script: python train_eval.py.

train_eval.py

This script contains the entire machine learning pipeline.
It accepts the following optional command-line arguments:

pca n_components: Apply PCA with n_components. Default: None.
method: Choose the method for feature selection. Options: 'rfe', 'rfecv', 'variance_threshold', 'chi2', 'anova', 'mutual_information'. Default: None.
k: Argument to be passed to the feature selection method. For example, k for k-best.
task: Specify which classification task to run. Options: 'binary', 'multi'. Default: 'multi'.
model_path: Path to load a saved model. The program will not train a new model if this argument is provided.

run_model_batch.ps1

This is an auxiliary script that runs the train_eval.py script with different combinations of feature selection methods. It is useful for running multiple experiments in batch mode.

visualize.py

Once run, this script generates plots for the training dataset and places them in the figures/dataset_plots directory. Plots that can be generated include:

Correlation matrix heatmap
boxplots
countplots
pairplots
scatterplots
histograms

plot_summary.py

This script reads reports from the reports/ directory and generates visualizations for model evaluation.

Customization

The program is highly customizable and it accepts arguments as specified above. A sample command looks like this:

python train_eval.py method=rfe k=10 task=binary

Visualization

Data visualization this done through the plot_summary.py script. This script reads reports from the reports/ directory and generates visualizations for model evaluation. An example of a visualization may resemble the following, but it differs depending on the model parameters and feature selection methods used:

Contributions

Junyi Dong: Data preprocessing, feature engineering, model training, hyperparameter tuning, model evaluation, documentation. TODO: Add other team members

License

No license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhancing Network Intrusion Detection with Feature Selection

Overview

Architecture

Prerequisites

Dataset

File Structure

Features

Usage

train_eval.py

run_model_batch.ps1

visualize.py

plot_summary.py

Customization

Visualization

Contributions

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
diagrams		diagrams
features		features
figures		figures
grid_search_output		grid_search_output
reports		reports
sample_reports		sample_reports
.gitignore		.gitignore
README.md		README.md
constants.ini		constants.ini
grid_search_binary.out		grid_search_binary.out
grid_search_multi.out		grid_search_multi.out
plot_summary.py		plot_summary.py
requirements.txt		requirements.txt
run_model_batch.ps1		run_model_batch.ps1
train_eval.ipynb		train_eval.ipynb
train_eval.py		train_eval.py
visualize.py		visualize.py

demibai/INCS870-Spring2024-Team-06

Folders and files

Latest commit

History

Repository files navigation

Enhancing Network Intrusion Detection with Feature Selection

Overview

Architecture

Prerequisites

Dataset

File Structure

Features

Usage

train_eval.py

run_model_batch.ps1

visualize.py

plot_summary.py

Customization

Visualization

Contributions

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages