AccurateVul Dataset Pipeline

This pipeline assumes an already executed version of MegaVul's inital first three pipeline phases that collects the NVD, filters by language and CWE and extracts commits from 28 different git hosting platforms. Please see https://github.com/icyrockton/megavul for documentation, source code and usage. Note: this pipeline is only usable if there exists access to the storage created by MegaVul. You MUST ensure that there exists a directory that contains the storage structure of MegaVul before running this project!

How to Run

The pipeline comes with an "all in all" script that runs the pipeline end-to-end. The final product is a directory containing the entire dataset, temporally split in three separate files, training set, validation set and test set.

Create a virtual environment

python -m venv .venv

Activate the environment

Windows PowerShell:

./.venv/Scripts/activate.ps1

Unix:

./.venv/Script/activate

Install the required packages

pip install -r requirements.txt

Run the pipeline initializer

python run_pipeline.py

Finetuning

This section briefly explains on how to reproduce the models described in the thesis for reproducibility.

StarCoder2-3b

StarCoder2-3b is a decoder only model finetuned with LoRA via Unsloth. Them odel is loaded 4-bit quantization, and LORA adapters are pplied to all attention and MLP projection layers (q/k/v/o/gate/up/down). Training using SFT (Supervised Fine Tuning) with a structuted prompt: a task instruction, the truncated source code, and the laabel suffix. The model learn to generate "VULN" or "BENIGN". At inferece, the model generates up to 16 tokens and the output is parsed with regex for the label. Default hyperparamters:

lr=2e^-4
batch 2
grad 8
batch size 16
2 epoch
cosine scheduler
adamw_8bit
max sequence 1024

Usage:

Train:

 python finetune.py train --task binary --data balanced --adapter-dir ./adapters/balanced-binary-lora

Eval:

python finetune.py eval  --task binary --adapter-dir ./adapters/balanced-binary-lora --test-ratio 27

CodeBERT-125M

Encoder model with single logit classification head. Trainig using Sigmoid BCE loss. Default is full fine-tuning (No LoRA, since model is small), matching the PrimeVul paper setup. The raw source code is tokenized directly - no prompt template - the model outputs a vulnerability probability sigmoid, thresholded at 0.5. Hyperparameters follow Primevul experiments:

10 epochs
lr 2e^-5
batch 64
1000 warm up steps
linear scheduler
max_grad_norm 1
optional class weighting and LORA modes are available via flags but not used in default runs presented in thesis

Docker

Dispite the usage of Docker, not all machines can run this. For this specific finetuning a NVIDIA CUDA11.2 supported GPU is required. This image is strongly recommended as it doesn't require installation of torch or CUDA. These two come compressed on the image and makes your life much easier...

docker pull jonasdaderman/codebert-thesis:latest

Usage:

python finetune.py train --dataset primevul --task binary --data full --base-model microsoft/codebert-base --adapter-dir ./adapters/codebert-primevul --epochs 10 --learning-rate 2e-5 --batch-size 64 --warmup-steps 1000

python finetune.py eval  --dataset primevul --task binary --base-model microsoft/codebert-base --adapter-dir ./adapters/codebert-primevul

Full Replication

For those wanting to fully replicate the material please use the PowerShell scripts provided in the directory.

Any UNIX user could easily translate these to bash using Claude or similar tools.

Training: ./run_all_finetunings.ps1 Evaluation: ./run_all_eval.ps1

Thesis: LINK TO DIVA PORTAL.

Documentation

Extract Functions

The first step of this pipeline extracts the functions from the commits downloaded. It does this through AST-treesitter which identifies C/C++ function declarators and extracts the contents.

Recover Kernel

As one may have noticed trying to execute the MegaVul pipeline, it requests a ton of resources from git.kernel.org which in the case of this thesis was blocked. In order to obtain these vulnerabilities, the Github mirror of the linux kernel was accessed instead. How this is done is found in the thesis.

Recover Partial Downloads

Some commits in the MegaVul commit cache have one side (fix or parent) but not the other. This caused many vulnerabilities to get silently dropped. To fix this, this step of the pipeline audits the cache for asymmetric pairs and fetches the missing side and re-extracts the functions for affect commits.

Merge Recovery

Folds the recovered rows from the kernel and partial downloads back into the main set of vulnerable functions. Replaces rows for commit touched by recovery and deduplicates the dataset using "hash_dedup_check.py"

Classify Merged

This is the contribution of the thesis in terms of data-pipelining. The program evaluates each data sample and assigns each group to one of five labels on two axis: HOw many functions changed in the commit (single vs multiple) and how many of those are mentioned by name in the CVE description (none, one or multiple). A trivial-diff prefiltering strips rows where the only changes are whitespaces or comments before grouping so they don't inflate the function count.

Heuristics

For the hardest category i.e., MULTI_FUNCTION_MULTI_MENTION (MFMM) commits where multiple function names appear in the description and multi functions are touched across commits, heuristics are applied to evaluate the truly vulnerable function. Five heuristics are applied, described in the thesis. A fraction is recovered in this step ~160 rows.

Checks and Validiations

In order to maintain sanity while working with data, several checks and validations, some of which are redundant but validates previous steps (e.g. deduplication is performed twice). C3-C8 are new checks that filters the data. C1 &C2 are dedups and whitespace strips.

C3 - Drop functions less than 50 characters. Too short
C4 - Sanity check: Remove any trivial diffs
C5 - Remove test/example files, e.g. any unit, integration tests
C6a - For some reason, the NVD points several CVEs to the same commit. This means that the same vulnerable piece of code is duplicated but on different CVEs. Without this, one fix commit mentioned by thee CVEs would contribute three "unique" based on CVE functions which already exist. E.g. CVE-2023-001 through 003 all reference the same commit. This rule collapses this into one.
C6b - Sometimes, a single CVE is fixed across multiple commits but the same function. This ensures we keep the final version only. CVE-2022-555 is on the main branch, but also references a commit on the stable branch. Both are downloaded as they mention the code, but are identical. One commit is removed.
C7 - Target CVEs that have multiple commits in the dataset. If some of those commits contain functions that are mentioned in the CVE description and other don't, the unmentioned commits are likely follow-up or cleanups.
C8 - Catches mass-attirubiton entries in the NVD where a CVE liss a large range of commits without any function-level specifier. If a CVE has more than 5 distinct commits and none of its functions appear in the description, the entire CVE is dropped. These entries typically come from commit-range listings and there is no reliable signal about which functions are actually vulnerable.

Temporal Splits

Rows are split into train/val/test by date. Positives use the earliest commit date across all rows for their CVE. This is the final step and outputs the true dataset to be used!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
10_merge_pos_neg.py		10_merge_pos_neg.py
11_filter_polluted_bodies.py		11_filter_polluted_bodies.py
12_temporal_split.py		12_temporal_split.py
1_extract_functions.py		1_extract_functions.py
2_recover_kernel_commits.py		2_recover_kernel_commits.py
3_recover_partial_downloads.py		3_recover_partial_downloads.py
4_merge_recovery.py		4_merge_recovery.py
5_classify_merged.py		5_classify_merged.py
6_heuristic_mf_multi.py		6_heuristic_mf_multi.py
7_merge_and_clean.py		7_merge_and_clean.py
8_apply_c7_c8_rules.py		8_apply_c7_c8_rules.py
README.MD		README.MD
balance_dataset.py		balance_dataset.py
extract_non_vulnerable_functions.py		extract_non_vulnerable_functions.py
finetune.py		finetune.py
hash_dedup_check.py		hash_dedup_check.py
make_balanced_with_realistic_test.py		make_balanced_with_realistic_test.py
requirements.txt		requirements.txt
run_all_eval.ps1		run_all_eval.ps1
run_all_finetunings.ps1		run_all_finetunings.ps1
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AccurateVul Dataset Pipeline

How to Run

Finetuning

StarCoder2-3b

Usage:

CodeBERT-125M

Docker

Usage:

Full Replication

Thesis: LINK TO DIVA PORTAL.

Documentation

Extract Functions

Recover Kernel

Recover Partial Downloads

Merge Recovery

Classify Merged

Heuristics

Checks and Validiations

Temporal Splits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AccurateVul Dataset Pipeline

How to Run

Finetuning

StarCoder2-3b

Usage:

CodeBERT-125M

Docker

Usage:

Full Replication

Thesis: LINK TO DIVA PORTAL.

Documentation

Extract Functions

Recover Kernel

Recover Partial Downloads

Merge Recovery

Classify Merged

Heuristics

Checks and Validiations

Temporal Splits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages