Skip to content

EngineeringSoftware/exLong

Repository files navigation

🐲🔨 exLong: Generating Exceptional Behavior Tests with Large Language Models

exLong is a large language model instruction-tuned from CodeLlama and embeds reasoning about

  • traces that lead to throw statements
  • conditional expressions that guard throw statements
  • non-exceptional behavior tests that execute similar traces

About

This repo hosts the code and data for the following ICSE 2025 paper:

Title: exLong: Generating Exceptional Behavior Tests with Large Language Models

Authors: Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

@inproceedings{ZhangETAL25exLong,
  author = {Zhang, Jiyang and Liu, Yu and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
  title = {exLong: Generating Exceptional Behavior Tests with Large Language Models},
  booktitle = {International Conference on Software Engineering},
  year = {2025},
}

We also include the implementation of the CLI tool described in our FSE 2025 Demo Paper:

Title: A Tool for Generating Exceptional Behavior Tests With Large Language Models

Authors: Linghan Zhong, Samuel Yuan, Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

@inproceedings{ZhongETAL25exLongTool,
  author = {Zhong, Linghan and Yuan, Samuel and Zhang, Jiyang and Liu, Yu and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
  title = {A Tool for Generating Exceptional Behavior Tests With Large Language Models},
  booktitle = {ACM International Conference on the Foundations of Software Engineering Demonstrations},
  year = {2025},
}

Table of Contents

  1. Quick Start 🤗
  2. Set Up 🚀
  3. Experiments 👷
  4. Artifacts
  5. CLI 💻

Quick Start

  • The exLong dataset is on Hugging Face 🤗!
from datasets import load_dataset

with_name_ds = load_dataset("EngineeringSoftware/exLong-dataset", "with-EBT-name")
no_name_ds = load_dataset("EngineeringSoftware/exLong-dataset", "no-EBT-name")
  • The exLong model is on Hugging Face 🤗!
pip install transformers accelerate bitsandbytes peft
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

# Load the base model
base_model_name = "codellama/CodeLlama-7b-Instruct-hf"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)

# Load the LoRA configuration
peft_model_id = "EngineeringSoftware/exLong"
config = PeftConfig.from_pretrained(peft_model_id, revision="with-etest-name")  # set revision to "no-etest-name" for no EBT name

# Load the LoRA model
model = PeftModel.from_pretrained(base_model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

prompt = """<s>[INST] <<SYS>>
You are a helpful programming assistant and an expert Java programmer. You are helping a user writing exceptional-behavior tests for their Java code.
<</SYS>>

Please complete an exceptional behavior test method in Java to test the method 'factorial' for the exception 'IllegalArgumentException'.
The method to be tested is defined as:
```java
public static long factorial(int n) {
    if (n < 0) {
        throw new IllegalArgumentException("Number must be non-negative.");
    }
    long result = 1;
    for (int i = 1; i <= n; i++) {
        result *= i;
    }
    return result;
}
` ` `
Please only give the new exceptional-behavior test method to complete the following test class. Do NOT use extra libraries or define new helper methods. Return **only** the code in the completion:
```java
public class FactorialTest {
}
` ` `
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate code
output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    temperature=0.2,      # Sampling temperature (lower is more deterministic)
    top_p=0.95,           # Top-p (nucleus) sampling
    do_sample=True        # Enable sampling
)

# Decode and print the generated code
generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Code:")
print(generated_code)

Set Up

Dependencies Set Up

  1. Create conda environment
conda create -n exlong python=3.9
conda activate exlong
pip install -r requirements.txt
  1. We used axolotl to fine-tune the CodeLlama model. If you want to train your own model, install the extra dependencies
# we used an older version of axolotl to train the models
git clone [email protected]:JiyangZhang/axolotl-exlong.git
cd axolotl-exlong/
conda activate exlong
pip install packaging
# set CUDA_HOME
export CUDA_HOME=/opt/apps/cuda/12.0/
pip3 install -e '.[flash-attn,deepspeed]'

Experiments Set Up

  1. Download raw dataset
mkdir -p _work/data/
mkdir -p _work/exp/
mkdir -p _work/setup/

wget -L https://utexas.box.com/shared/static/hfcp4za3j9vp8lh5u8iviadixuxu8080.gz -O raw-data.tar.gz
tar -xzf raw-data.tar.gz -C _work/data/
mv _work/data/etestgen-raw-data-12k _work/data/ne2e

wget -L https://utexas.box.com/shared/static/4m7mntp0ix18dkl1ikkspcmpuvybfs1f.gz -O ne2e-test.tar.gz
tar -xzf ne2e-test.tar.gz -C _work/data/

wget -L https://utexas.box.com/shared/static/y4e52k5x8vk8vcr59lg33gebcg2m1caw.gz -O rq2.tar.gz
tar -xzf rq2.tar.gz -C _work/data/

# netest-diversity
wget -L https://utexas.box.com/shared/static/j417e93j1rdvdqz2yobttygfhucfbkjm.gz -O netest-diversity.tar.gz
tar -xzf netest-diversity.tar.gz -C _work/data/

You should see _work/data/ne2e, _work/data/rq1-eval, _work/data/rq2 and _work/data/netest-diversity.

  1. Prepare dataset and put them in the _work/setup directory
  • exLong && exlong sample (Table IV & V)
# exlong
inv -e data.setup-model-data --setup-name conditionnestack2e-with-name-ft
inv -e data.setup-model-data --setup-name conditionnestack2e-no-name-ft

# exlong sample
inv -e data.setup-model-data --setup-name conditionnestack2e-all-with-name-ft
inv -e data.setup-model-data --setup-name conditionnestack2e-all-no-name-ft

You should see _work/setup/conditionnestack2e-with-name-ft/, _work/setup/conditionnestack2e-no-name-ft/, _work/setup/conditionnestack2e-all-with-name-ft/, _work/setup/conditionnestack2e-all-no-name-ft/ directories.

  1. Construct prompts for exLong developer-view
  • exLong
inv -e data.process-codellama-data --setup-name conditionnestack2e-with-name-ft
inv -e data.process-codellama-data --setup-name conditionnestack2e-no-name-ft
  1. Construct prompts for exLong machine-view
mkdir _work/setup/conditionnestack2e-all-no-name-ft/eval/ -p
cp -r _work/data/rq2/ _work/setup/conditionnestack2e-all-no-name-ft/eval/
python -m etestgen.codellama.realDataProcessor --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml process_test_data

You will see _work/setup/conditionnestack2e-all-no-name-ft/eval/rq2/test-conditionnestack2e-all-no-name-ft.jsonl.

Experiments

Training

  1. Training exLong w. EBT name

Note: conditionnestack2e is the setup name for exLong

cd python/
accelerate launch -m axolotl.cli.train configs/axolotl/axolotl-conditionnestack2e-with-name-7b.yaml

You will see checkpoints in directory _work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/

  1. Training exLong w.o. EBT name
cd python/
accelerate launch -m axolotl.cli.train configs/axolotl/axolotl-conditionnestack2e-no-name-7b.yaml
# script to run on TACC
sbatch axolotl-lora-codellama-7b-conditionnestack2e-no-name.sh

You will see checkpoints in directory _work/exp/conditionnestack2e-no-name-ft/lora-codellama-7b/

  1. Running inference exLong for developer-view
cd python/
# Run evaluation on the selected 434 examples in the test set
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-conditionnestack2e-with-name-ft.yaml run_gen --split real-test

You will see checkpoints, model outputs in directory _work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/real-test-set-model-outputs.jsonl

  1. Running inference exLong for machine-view
cd python/
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml run_gen
# Evaluation1: all covered projects
python -m etestgen.llm.eval --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml eval_runtime_metrics
# You will see eval results in `results/model-results/conditionnestack2e-all-no-name-ft-lora-codellama-7b-eval-rq2-runtime-metrics.json`
# Evaluation2: intersection projects
python -m etestgen.llm.eval --eval_set rq2 --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml eval_subset_llm_results --subset_id_file ../results/tool-results/intersect-ids.json
# You will see eval results in `results/model-results/conditionnestack2e-all-no-name-ft-lora-codellama-7b-eval-rq2-intersect-runtime-metrics.json`

You will see model generations in directory _work/exp/conditionnestack2e-all-no-name-ft/lora-codellama-7b/rq2-model-outputs.jsonl

Evaluation: compute metrics

Given test cases generated by exLong, this step will evaluate them with metrics like BLEU, CodeBLEU, Test Coverage, etc.

  • Input/Output info

    • Dataset used to evaluation is expected at _work/{test_data}
    • Processed LLM prediction is expected at _work/exp/{setup}/{model_name}/test-results
    • Similarity metrics result will be written to _work/exp/{setup}/{model_name}/test-out/similarity_metrics_summary.json and results/model-results/{setup}-{exp}-{eval_set}-sim-metrics.json
    • Runtime Metrics will be written to results/model-results/{setup}-{exp}-{eval_set}-runtime-metrics.json and individual result will be at _work/exp/{setup}/{model_name}/test-results/metrics.jsonl
  • To run evaluation on similarity metrics

    • Run on an individual experiment

      python -m etestgen.llm.eval --eval_set test --config_file [/path/to/config/file] eval_llm_sim
  • To run evaluation on runtime metrics

    • Run on an individual experiment

      python -m etestgen.llm.eval --eval_set test --config_file [/path/to/config/file] eval_runtime_metrics

Ablations on exLong's context

Diversity of the nEBTs

  1. Prepare Dataset
mkdir -p _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/
mkdir -p _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/
cp -r _work/data/netest-diversity/* _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/
cp -r _work/data/netest-diversity/* _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/
cd python/
python -m etestgen.codellama.DataProcessor --config_file configs/codellama-7b-diversity-conditionnestack2e-sample-with-name-ft.yaml process_real_test_data
python -m etestgen.codellama.DataProcessor --config_file configs/codellama-7b-diversity-conditionnestack2e-all-with-name-ft.yaml process_real_test_data

You will see processed data in _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/ and _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/.

  1. Running Inference
# [2nd row] use the same exLong ckpt but try prompting with the same nEBT multiple times
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-diversity-conditionnestack2e-sample-with-name-ft.yaml  run_gen --split real-test --target_ckpt ../_work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/
# [3rd row] use the same exLong ckpt but try prompting with different nEBTs
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-diversity-conditionnestack2e-all-with-name-ft.yaml  run_gen --split real-test --target_ckpt ../_work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/

You will see model outputs in directory _work/exp/diversity-conditionnestack2e-sample-with-name-ft/lora-codellama-7b/ and _work/exp/diversity-conditionnestack2e-all-with-name-ft/lora-codellama-7b/

exLong w.o. stack trace

  1. Prepare Dataset
inv -e data.setup-model-data --setup-name conditionne2e-with-name-ft
inv -e data.process-codellama-data --setup-name conditionne2e-with-name-ft
  1. Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file python/configs/eval/codellama-7b-conditionne2e-with-name-ft.yaml run_gen

exLong w.o. stack trace & guard expression

  1. Prepare Dataset
inv -e data.setup-model-data --setup-name ne2e-with-name-ft
inv -e data.process-codellama-data --setup-name ne2e-with-name-ft
  1. Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval/codellama-7b-ne2e-with-name-ft.yaml run_gen

exLong w.o. stack trace & guard expression & nEBT

  1. Prepare Dataset
inv -e data.setup-model-data --setup-name mut2e-with-name-ft
inv -e data.process-codellama-data --setup-name mut2e-with-name-ft
  1. Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval/codellama-7b-mut2e-with-name-ft.yaml run_gen

Artifacts:

Model Checkpoints:

Dataset:

  • repos.tar.gz: The repository list from which we collected the dataset.
  • raw-data.tar.gz: The raw collected data from the open-source repositories. etestgen-raw-data-12k/
  • ne2e-test.tar.gz: The collected dataset for eval in developer-view. rq1-eval/
  • machine-view.tar.gz: The collected dataset for eval in machine-view. rq2/
  • netest-diversity.tar.gz: The collected dataset we use to study how the different nEBTs affect model's performance (Table VII). netest-diversity/
  • processed dataset: The processed dataset (prompts) to train the exLong models.

CLI

User-View Use Case

In User View, Exlong generates an EBT for a user-specified target throw statement. Use the following command:

python -m etestgen.cli user_view [OPTIONS]

Required Parameters

  • --repo_path: Local path or remote link to the git repository
  • --mut_file_path: Path to the file containing the MUT
  • --mut_line: Line number of the beginning of the MUT's definition
  • --throw_file_path: Path to the file containing the target throw statement
  • --throw_line: Line number of the target throw statement
  • --test_context_path: Path to the test file

Optional Parameters

  • --sha: Commit SHA (default: latest commit on the main branch)
  • --test_name: Name of the test method to be generated (default: none)
  • --quant: Whether to use quantized LLM (default: true)
  • --pick_best: Whether to sample multiple candidate EBTs and select the best test based on runtime evaluation (default: false)
  • --output_file: Output file path for the generated EBT, if not given the EBTs are added to the test object in test_context_path.
  • --regenerate_data: Whether to run collect "stack trace", "guard condition", and etc. (default: true)

Example

python -m etestgen.cli user_view \
    --repo_path=./Wisp \
    --mut_file_path=Scheduler.java \
    --mut_line=180 \
    --quant=true \
    --throw_file_path=Scheduler.java \
    --throw_line=340 \
    --test_context_path=SchedulerTest.java \
    --sha="ce1d9f3cb1944115ad98b4428ea24b24ab3faf56" \
    --test_name=testSchedulerError \
    --pick_best=True \
    --output_file=./ExlongTest.java

This command will generate an exception-based test for the throw statement at line 340 in Scheduler.java, targeting the method that begins at line 180, and output the result to ExlongTest.java.

Machine View

The Machine View generates EBTs for the entire codebase to cover all throw statements automatically. Use the following command:

python -m etestgen.cli machine_view [OPTIONS]

Required Parameters

  • --repo_path or --repo_link: Local path or remote link to the git repository
  • --test_context_path: Path to the test file

Optional Parameters

  • --sha: Commit SHA (default: latest commit on the main branch)
  • --pick_best: Whether to sample multiple candidate EBTs for each throw statement and select the best test based on runtime evaluation (default: false)
  • --quant: Whether to use quantized LLM (default: true)
  • --timeout: Time budget for the tool to finish processing in seconds (default: infinity)
  • --output_file: Output file path for the generated EBTs, if not given the EBTs are added to the test object in test_context_path.
  • --regenerate_data: Whether to run collect "stack trace", "guard condition", and etc (default: true)

Example

python -m etestgen.cli machine_view \
    --repo_link="https://github.com/Coreoz/Wisp.git" \
    --sha="ce1d9f3cb1944115ad98b4428ea24b24ab3faf56" \
    --timeout=1000

This command will analyze the entire Wisp repository at the specified commit and generate EBTs for all throw statements within the given time budget of 1000 seconds.

About

exLong: Generating Exceptional Behavior Tests with Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages