Skip to content

ASSERT-KTH/AccurateVul

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AccurateVul Dataset Pipeline

This pipeline assumes an already executed version of MegaVul's inital first three pipeline phases that collects the NVD, filters by language and CWE and extracts commits from 28 different git hosting platforms. Please see https://github.com/icyrockton/megavul for documentation, source code and usage. Note: this pipeline is only usable if there exists access to the storage created by MegaVul. You MUST ensure that there exists a directory that contains the storage structure of MegaVul before running this project!

How to Run

The pipeline comes with an "all in all" script that runs the pipeline end-to-end. The final product is a directory containing the entire dataset, temporally split in three separate files, training set, validation set and test set.

Create a virtual environment

python -m venv .venv

Activate the environment

Windows PowerShell:

./.venv/Scripts/activate.ps1

Unix:

./.venv/Script/activate

Install the required packages

pip install -r requirements.txt

Run the pipeline initializer

python run_pipeline.py

Finetuning

This section briefly explains on how to reproduce the models described in the thesis for reproducibility.

StarCoder2-3b

StarCoder2-3b is a decoder only model finetuned with LoRA via Unsloth. Them odel is loaded 4-bit quantization, and LORA adapters are pplied to all attention and MLP projection layers (q/k/v/o/gate/up/down). Training using SFT (Supervised Fine Tuning) with a structuted prompt: a task instruction, the truncated source code, and the laabel suffix. The model learn to generate "VULN" or "BENIGN". At inferece, the model generates up to 16 tokens and the output is parsed with regex for the label. Default hyperparamters:

  • lr=2e^-4
  • batch 2
  • grad 8
  • batch size 16
  • 2 epoch
  • cosine scheduler
  • adamw_8bit
  • max sequence 1024

Usage:

Train:

 python finetune.py train --task binary --data balanced --adapter-dir ./adapters/balanced-binary-lora

Eval:

python finetune.py eval  --task binary --adapter-dir ./adapters/balanced-binary-lora --test-ratio 27

CodeBERT-125M

Encoder model with single logit classification head. Trainig using Sigmoid BCE loss. Default is full fine-tuning (No LoRA, since model is small), matching the PrimeVul paper setup. The raw source code is tokenized directly - no prompt template - the model outputs a vulnerability probability sigmoid, thresholded at 0.5. Hyperparameters follow Primevul experiments:

  • 10 epochs
  • lr 2e^-5
  • batch 64
  • 1000 warm up steps
  • linear scheduler
  • max_grad_norm 1
  • optional class weighting and LORA modes are available via flags but not used in default runs presented in thesis

Docker

Dispite the usage of Docker, not all machines can run this. For this specific finetuning a NVIDIA CUDA11.2 supported GPU is required. This image is strongly recommended as it doesn't require installation of torch or CUDA. These two come compressed on the image and makes your life much easier...

docker pull jonasdaderman/codebert-thesis:latest

Usage:

python finetune.py train --dataset primevul --task binary --data full --base-model microsoft/codebert-base --adapter-dir ./adapters/codebert-primevul --epochs 10 --learning-rate 2e-5 --batch-size 64 --warmup-steps 1000
python finetune.py eval  --dataset primevul --task binary --base-model microsoft/codebert-base --adapter-dir ./adapters/codebert-primevul

Full Replication

For those wanting to fully replicate the material please use the PowerShell scripts provided in the directory.

Any UNIX user could easily translate these to bash using Claude or similar tools.

Training: ./run_all_finetunings.ps1 Evaluation: ./run_all_eval.ps1

Thesis: LINK TO DIVA PORTAL.

Documentation

Extract Functions

The first step of this pipeline extracts the functions from the commits downloaded. It does this through AST-treesitter which identifies C/C++ function declarators and extracts the contents.

Recover Kernel

As one may have noticed trying to execute the MegaVul pipeline, it requests a ton of resources from git.kernel.org which in the case of this thesis was blocked. In order to obtain these vulnerabilities, the Github mirror of the linux kernel was accessed instead. How this is done is found in the thesis.

Recover Partial Downloads

Some commits in the MegaVul commit cache have one side (fix or parent) but not the other. This caused many vulnerabilities to get silently dropped. To fix this, this step of the pipeline audits the cache for asymmetric pairs and fetches the missing side and re-extracts the functions for affect commits.

Merge Recovery

Folds the recovered rows from the kernel and partial downloads back into the main set of vulnerable functions. Replaces rows for commit touched by recovery and deduplicates the dataset using "hash_dedup_check.py"

Classify Merged

This is the contribution of the thesis in terms of data-pipelining. The program evaluates each data sample and assigns each group to one of five labels on two axis: HOw many functions changed in the commit (single vs multiple) and how many of those are mentioned by name in the CVE description (none, one or multiple). A trivial-diff prefiltering strips rows where the only changes are whitespaces or comments before grouping so they don't inflate the function count.

Heuristics

For the hardest category i.e., MULTI_FUNCTION_MULTI_MENTION (MFMM) commits where multiple function names appear in the description and multi functions are touched across commits, heuristics are applied to evaluate the truly vulnerable function. Five heuristics are applied, described in the thesis. A fraction is recovered in this step ~160 rows.

Checks and Validiations

In order to maintain sanity while working with data, several checks and validations, some of which are redundant but validates previous steps (e.g. deduplication is performed twice). C3-C8 are new checks that filters the data. C1 &C2 are dedups and whitespace strips.

  • C3 - Drop functions less than 50 characters. Too short
  • C4 - Sanity check: Remove any trivial diffs
  • C5 - Remove test/example files, e.g. any unit, integration tests
  • C6a - For some reason, the NVD points several CVEs to the same commit. This means that the same vulnerable piece of code is duplicated but on different CVEs. Without this, one fix commit mentioned by thee CVEs would contribute three "unique" based on CVE functions which already exist. E.g. CVE-2023-001 through 003 all reference the same commit. This rule collapses this into one.
  • C6b - Sometimes, a single CVE is fixed across multiple commits but the same function. This ensures we keep the final version only. CVE-2022-555 is on the main branch, but also references a commit on the stable branch. Both are downloaded as they mention the code, but are identical. One commit is removed.
  • C7 - Target CVEs that have multiple commits in the dataset. If some of those commits contain functions that are mentioned in the CVE description and other don't, the unmentioned commits are likely follow-up or cleanups.
  • C8 - Catches mass-attirubiton entries in the NVD where a CVE liss a large range of commits without any function-level specifier. If a CVE has more than 5 distinct commits and none of its functions appear in the description, the entire CVE is dropped. These entries typically come from commit-range listings and there is no reliable signal about which functions are actually vulnerable.

Temporal Splits

Rows are split into train/val/test by date. Positives use the earliest commit date across all rows for their CVE. This is the final step and outputs the true dataset to be used!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors