INCS 870, Spring 2024, Team 06, supervised by Dr. Zhida Li
This project involves the development of a machine learning pipeline for network intrusion detection. The primary goal is to classify network activities into normal or attack categories, using the UNSW-NB15 dataset. The pipeline includes data preprocessing, feature engineering, model training, hyperparameter tuning, and model evaluation.
Our overall architecture is illustrated below:
Before running this project, ensure you have the following installed:
- Python 3.12
- Libraries: pandas, numpy, matplotlib, scikit-learn, xgboost
To install the required libraries, run:
pip install -r requirements.txt
The dataset used is the UNSW-NB15, which can be downloaded from [https://research.unsw.edu.au/projects/unsw-nb15-dataset] and should be placed under /unsw_nb15. The dataset includes various features related to network traffic and a label indicating normal or attack activity.
- train_eval.py: Main script containing the entire machine learning pipeline, including training and evaluation.
- unsw_nb15/: Directory containing the UNSW_NB15_*.csv files. Not included for copyright reasons.
- figures/: Directory containing visualizations generated by the pipeline.
- models/: Directory containing trained models.
- reports/: Directory containing model evaluation reports.
- requirements.txt: List of required Python libraries.
- README.md: Project overview and usage instructions.
- .gitignore: Files and directories to be ignored by Git.
The project includes the following key components:
To run the project:
- Place the dataset files in the unsw_nb15/ directory.
- Run the train_eval.py script: python train_eval.py.
This script contains the entire machine learning pipeline.
It accepts the following optional command-line arguments:
- pca n_components: Apply PCA with n_components. Default: None.
- method: Choose the method for feature selection. Options: 'rfe', 'rfecv', 'variance_threshold', 'chi2', 'anova', 'mutual_information'. Default: None.
- k: Argument to be passed to the feature selection method. For example, k for k-best.
- task: Specify which classification task to run. Options: 'binary', 'multi'. Default: 'multi'.
- model_path: Path to load a saved model. The program will not train a new model if this argument is provided.
This is an auxiliary script that runs the train_eval.py script with different combinations of feature selection methods. It is useful for running multiple experiments in batch mode.
Once run, this script generates plots for the training dataset and places them in the figures/dataset_plots directory. Plots that can be generated include:
- Correlation matrix heatmap
- boxplots
- countplots
- pairplots
- scatterplots
- histograms
This script reads reports from the reports/ directory and generates visualizations for model evaluation.
The program is highly customizable and it accepts arguments as specified above. A sample command looks like this:
python train_eval.py method=rfe k=10 task=binary
Data visualization this done through the plot_summary.py script. This script reads reports from the reports/ directory and generates visualizations for model evaluation. An example of a visualization may resemble the following, but it differs depending on the model parameters and feature selection methods used:
Junyi Dong: Data preprocessing, feature engineering, model training, hyperparameter tuning, model evaluation, documentation. TODO: Add other team members
No license