This pipeline assumes an already executed version of MegaVul's inital first three pipeline phases that collects the NVD, filters by language and CWE and extracts commits from 28 different git hosting platforms. Please see https://github.com/icyrockton/megavul for documentation, source code and usage. Note: this pipeline is only usable if there exists access to the storage created by MegaVul. You MUST ensure that there exists a directory that contains the storage structure of MegaVul before running this project!
The pipeline comes with an "all in all" script that runs the pipeline end-to-end. The final product is a directory containing the entire dataset, temporally split in three separate files, training set, validation set and test set.
Create a virtual environment
python -m venv .venv
Activate the environment
Windows PowerShell:
./.venv/Scripts/activate.ps1
Unix:
./.venv/Script/activate
Install the required packages
pip install -r requirements.txt
Run the pipeline initializer
python run_pipeline.py
This section briefly explains on how to reproduce the models described in the thesis for reproducibility.
StarCoder2-3b is a decoder only model finetuned with LoRA via Unsloth. Them odel is loaded 4-bit quantization, and LORA adapters are pplied to all attention and MLP projection layers (q/k/v/o/gate/up/down). Training using SFT (Supervised Fine Tuning) with a structuted prompt: a task instruction, the truncated source code, and the laabel suffix. The model learn to generate "VULN" or "BENIGN". At inferece, the model generates up to 16 tokens and the output is parsed with regex for the label. Default hyperparamters:
- lr=2e^-4
- batch 2
- grad 8
- batch size 16
- 2 epoch
- cosine scheduler
- adamw_8bit
- max sequence 1024
Train:
python finetune.py train --task binary --data balanced --adapter-dir ./adapters/balanced-binary-lora
Eval:
python finetune.py eval --task binary --adapter-dir ./adapters/balanced-binary-lora --test-ratio 27
Encoder model with single logit classification head. Trainig using Sigmoid BCE loss. Default is full fine-tuning (No LoRA, since model is small), matching the PrimeVul paper setup. The raw source code is tokenized directly - no prompt template - the model outputs a vulnerability probability sigmoid, thresholded at 0.5. Hyperparameters follow Primevul experiments:
- 10 epochs
- lr 2e^-5
- batch 64
- 1000 warm up steps
- linear scheduler
- max_grad_norm 1
- optional class weighting and LORA modes are available via flags but not used in default runs presented in thesis
Dispite the usage of Docker, not all machines can run this. For this specific finetuning a NVIDIA CUDA11.2 supported GPU is required. This image is strongly recommended as it doesn't require installation of torch or CUDA. These two come compressed on the image and makes your life much easier...
docker pull jonasdaderman/codebert-thesis:latest
python finetune.py train --dataset primevul --task binary --data full --base-model microsoft/codebert-base --adapter-dir ./adapters/codebert-primevul --epochs 10 --learning-rate 2e-5 --batch-size 64 --warmup-steps 1000
python finetune.py eval --dataset primevul --task binary --base-model microsoft/codebert-base --adapter-dir ./adapters/codebert-primevul
For those wanting to fully replicate the material please use the PowerShell scripts provided in the directory.
Any UNIX user could easily translate these to bash using Claude or similar tools.
Training: ./run_all_finetunings.ps1
Evaluation: ./run_all_eval.ps1
The first step of this pipeline extracts the functions from the commits downloaded. It does this through AST-treesitter which identifies C/C++ function declarators and extracts the contents.
As one may have noticed trying to execute the MegaVul pipeline, it requests a ton of resources from git.kernel.org which in the case of this thesis was blocked. In order to obtain these vulnerabilities, the Github mirror of the linux kernel was accessed instead. How this is done is found in the thesis.
Some commits in the MegaVul commit cache have one side (fix or parent) but not the other. This caused many vulnerabilities to get silently dropped. To fix this, this step of the pipeline audits the cache for asymmetric pairs and fetches the missing side and re-extracts the functions for affect commits.
Folds the recovered rows from the kernel and partial downloads back into the main set of vulnerable functions. Replaces rows for commit touched by recovery and deduplicates the dataset using "hash_dedup_check.py"
This is the contribution of the thesis in terms of data-pipelining. The program evaluates each data sample and assigns each group to one of five labels on two axis: HOw many functions changed in the commit (single vs multiple) and how many of those are mentioned by name in the CVE description (none, one or multiple). A trivial-diff prefiltering strips rows where the only changes are whitespaces or comments before grouping so they don't inflate the function count.
For the hardest category i.e., MULTI_FUNCTION_MULTI_MENTION (MFMM) commits where multiple function names appear in the description and multi functions are touched across commits, heuristics are applied to evaluate the truly vulnerable function. Five heuristics are applied, described in the thesis. A fraction is recovered in this step ~160 rows.
In order to maintain sanity while working with data, several checks and validations, some of which are redundant but validates previous steps (e.g. deduplication is performed twice). C3-C8 are new checks that filters the data. C1 &C2 are dedups and whitespace strips.
- C3 - Drop functions less than 50 characters. Too short
- C4 - Sanity check: Remove any trivial diffs
- C5 - Remove test/example files, e.g. any unit, integration tests
- C6a - For some reason, the NVD points several CVEs to the same commit. This means that the same vulnerable piece of code is duplicated but on different CVEs. Without this, one fix commit mentioned by thee CVEs would contribute three "unique" based on CVE functions which already exist. E.g. CVE-2023-001 through 003 all reference the same commit. This rule collapses this into one.
- C6b - Sometimes, a single CVE is fixed across multiple commits but the same function. This ensures we keep the final version only. CVE-2022-555 is on the main branch, but also references a commit on the stable branch. Both are downloaded as they mention the code, but are identical. One commit is removed.
- C7 - Target CVEs that have multiple commits in the dataset. If some of those commits contain functions that are mentioned in the CVE description and other don't, the unmentioned commits are likely follow-up or cleanups.
- C8 - Catches mass-attirubiton entries in the NVD where a CVE liss a large range of commits without any function-level specifier. If a CVE has more than 5 distinct commits and none of its functions appear in the description, the entire CVE is dropped. These entries typically come from commit-range listings and there is no reliable signal about which functions are actually vulnerable.
Rows are split into train/val/test by date. Positives use the earliest commit date across all rows for their CVE. This is the final step and outputs the true dataset to be used!