GitHub - simisimon/dependency-validation-rag

Paper

PDF: will be linked later

ABSTRACT

Retrieval-augmented generation (RAG) is an umbrella of different components, design decisions, and domain-specific adaptations to enhance the capabilities of large language models and counter their limitations regarding hallucination and outdated and missing knowledge. Since it is unclear which design decisions lead to a satisfactory performance, developing RAG systems is often experimental and needs to follow a systematic and sound methodology to gain sound and reliable results. However, there is currently no generally accepted methodology for RAG evaluation despite a growing interest in this technology.

In this paper, we propose a first blueprint of a methodology for a sound and reliable evaluation of RAG systems and demonstrate its applicability on a real-world software engineering research task: the validation of configuration dependencies across software technologies.

In summary, we make two novel contributions: (i) A novel, reusable methodological design for evaluating RAG systems, including a demonstration that represents a guideline, and (ii) a RAG system, which has been developed following this methodology, that achieves the highest accuracy in the field of dependency validation. For the blueprint's demonstration, the key insights are the crucial role of choosing appropriate baselines and metrics, the necessity for systematic RAG refinements derived from qualitative failure analysis, as well as the reporting practices of key design decision to foster replication and evaluation.

Project Structure

/data: contains data of subject systems, ingested data, and validation results
/evaluation: contains script for evaluation
/src: contains implementation the RAG system

Supplementary Material

RQ1

We present the different RAG variants and their configuration used in our study.

ID	Embedding Model	Embedding Dimension	Reranking	Top N
1	text-embed-ada-002	1536	Colbert	5
2	gte-Qwen2-7B-instruct	3584	Colbert	5
3	gte-Qwen2-7B-instruct	3584	Sentence Transformer	5
4	gte-Qwen2-7B-instruct	3584	Colbert	3

RQ2

We present the failure categories along with a brief description, the involved technologies, and the actionable that can be taken from them to reduce the number of failures in these categories.

Failure Cat.	Description	Technologies	Actionable
Inheritance and Overrides	Maven introduces project inheritance, allowing modules to inherit configuration from a parent module, such as dependencies, plugins, properties, and build settings. This means, for instance, while the groupID of a project generally gets inherited and is not a dependency if set explicitly, this is not true if a module is depending on another one. Meaning that in these cases the groupID has to be set explicitly and be the same.	Maven	Provide project-specific information on project structure and module organization
Configuration Consistency	Often values are the same across configuration files to ensure consistency. In this failure category, LLMs confuse equal values for the sake of consistency with dependencies.	Docker-Compose, Maven, Node.js, Spring Boot	Specialize prompt to distinguish consistency and dependency
Resource Sharing	Sometimes resources, such as databases or services can be shared across modules or used exclusively by a single module. Without additional project-specific information on how resources belong to modules, LLMs struggle to identify these dependencies.	Docker-Compose, Spring	Provide project-specific information on available resources
Port Mapping	The ports of a service (e.g., Web server) are typically defined in several configuration files of different technologies, such as `application.yml`, `Dockerfile`, and `Dockerfile`. However, not all port mappings have to be equal (e.g., a container and host port in docker compose).	Docker, Docker-Compose, Spring	Provide examples for port mapping dependencies and non-dependencies
Ambiguous Options Names	Software projects often use ambiguous naming schemes for configuration options and their values. These ambiguities result from generic and commonly used names (e.g., project name) that may not cause configuration errors if not consistent but can easily lead to misinterpretation by LLMs.	Docker-Compose, Maven, Spring	Specialize prompt to create awareness of naming conventions
Context (Availability, Retrieval, and Utilization)	Failures in this category are either because relevant information is missing (e.g., not in the vector database or generally not available to vanilla LLMs), available in the database but not retrieved, or given to the LLM but not utilized to draw the right conclusion.	Docker-Compose, Maven	Add context, improve sources, or improve retrieval and prompting
Independent Technologies and Services	In some cases (e.g., in containerized projects), different components are isolated by design. Thus, in these cases, the configuration options between these components are independent if not explicitly specified.	Docker, Docker-Compose	Provide examples of dependent and independent cases
Others	This category contains all cases in which the LLMs fail to classify the dependencies correctly that cannot be matched to any other category and share no common structure.	Docker, Docker-Compose, Spring, Maven, Node.js, TSconfig	Provide similar examples if possible

Running Ingestion, Retrieval, and Generation Pipeline

The RAG system consists of three pipelines that have to be executed one after the other, inluding the ingestion, retrieval, and generation pipeline. Before you run the retrieval and generation pipeline, you must first set up the vector database by running the ingestion pipeline. You can then run the retrieval pipeline to retrieve the context and afterwards the generation pipeline to generate validation responses.

A .env file in the root directory containing the API token for OpenAI, Pinecone, and GitHub is required to run the different pipelines.

OPENAI_KEY=<your-openai-key>
PINECONE_API_KEY=<your-pinecone-key>
GITHUB_TOKEN=<your-github-key>

Run Ingestion Pipeline

For running the ingestion pipeline, there are different parameters to be adjusted in ingestion_config.toml:

embedding_model defines the embedding model, e.g., qwen or openai.
embedding_dimension defines the dimension of the embedding model, e.g., 3584 for qwen or 1536 for openai.
splitting defines the splitting algorithm, e.g., sentence.
urls defines the urls that should be scraped and index into the vector database.
github defines a list of github repositories from which content should scraped and index into the vector database.
data defined a data directory of pre-processed text files that should be scraped and index into the vector database.

To run the ingestion pipeline, execute the jupyter notebooke src/ingestion_pipeline.ipynb.

Run Retrieval Pipeline

As soon as the vector database has been set up and filled with context information, the retrieval pipeline can be executed. For running the retrieval pipeline, there are different parameters to be adjusted in retrieval_config.toml:

embedding_model defines the embedding model, e.g., qwen or openai.
embedding_dimension defines the dimension of the embedding model, e.g., 3584 for qwen or 1536 for openai.
index_name defines the index from which data should be retrieved, the index all retrieves context from all existing indices in the vector database.
data_file defines the data file containing the dependencies for which additional context should be retrieved.
outfile defines the output file (JSON) to store the dependencies with the retrieved context for dependency validation.
splitting defines the splitting algorithm, e.g., sentence..
num_websites defines the number of website to query when retrieving dynamic context for dependency validation, e.g., 3.
top_k defines the number of context chunks to retrieve.
alpha defines the weight for sparse/dense retrieval, set to 0.5 for hybrid search.
rerank defines the re-ranking algorithm, e.g., colbert or sentence.
top_n defines the final number of context chunks that are sent to the LLM, e.g., 3 or 5.

To run the retrieval pipeline, execute the Python script src/retrieval_pipeline.py.

Run Generation Pipeline

As soon as you obtained retrieved context from the retrieval pipeline, the generation pipeline can be executed.
For running the generation pipeline, there are different parameters to be adjusted in generation_config.toml:

data_file defines the data file (JSON) containing the dependencies and the retrieved context for dependency validation.
output_file defines the output file (JSON) to store the validation responses.
with_rag should be true to run the validation with RAG, else false.
with_refinements should be true to run generation with refinements, by default false.
model_name defines the name of LLM used for dependency validation.
temperature defines the temperature of the LLM. Lower temperature values result into more deterministic results. Is set to 0.0.

To run the generation pipeline, execute the Python script src/generation_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper

ABSTRACT

Project Structure

Supplementary Material

RQ1

RQ2

Running Ingestion, Retrieval, and Generation Pipeline

Run Ingestion Pipeline

Run Retrieval Pipeline

Run Generation Pipeline

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
data		data
evaluation		evaluation
src		src
.gitignore		.gitignore
README.md		README.md
config.toml		config.toml
generation_config.toml		generation_config.toml
ingestion_config.toml		ingestion_config.toml
retrieval_config.toml		retrieval_config.toml

simisimon/dependency-validation-rag

Folders and files

Latest commit

History

Repository files navigation

Paper

ABSTRACT

Project Structure

Supplementary Material

RQ1

RQ2

Running Ingestion, Retrieval, and Generation Pipeline

Run Ingestion Pipeline

Run Retrieval Pipeline

Run Generation Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages