RAG for Source Text Summarization

RAG (retrieval augmented generation) is a big hype right now to infuse LLM outputs with relevant context. But can we leverage the approaches for the summarization of source texts? How is the performance compared to traditional summarization models?

Team members

Supervisor

ANSCHÜTZ, Miriam (M.Sc./PhD Candidate, Social Computing Group, TUM)

Task description

Your task is to use RAG systems and evaluate their performance on text simplification benchmarks like CNN-daily or XSUM. The steps comprise:

Get familiar with RAG. A possible starting point is this tutorial.
Set up a pipeline to run RAG for summarization. What are good questions to yield a summary?
Extend your experiments with different retrievers, LLMS, and benchmark datasets.

Project structure

This project is organized as follows:

app/
- evaluation/
  - EvaluateSummary
  - ExternalHelper
  - SaveMetric
  - VisualizeMetric
  - __init__.py
  - benchmark.py
  - time_tracking.py
- ingestion/
  - __init__.py
  - huggingface_datasets_ingestor.py
  - ingestor.py
- llms/
  - __init__.py
  - bart.py
- retrieval/
  - __init__.py
  - auto_merging_retriever.py
  - extractive_retriever.py
  - kg_retriever.py
  - retriever.py
  - sentence_window_retriever.py
  - simple_query_engine.py
  - topic_extractor.py
  - topic_retriever.py
- __init__.py
- configs.py
- rag.py
data/
- indexes/
- presentations/
- report/
- source-texts/
- test-folder/
.gitattributes
.gitignore
.pre-commit-config.yaml
benchmark.sh
visualization.sh
poetry.lock
pyproject.toml
rag_pipeline_demo.ipynb
README.md

Each directory and file plays a specific role in the project:

/app/ - Holds the main RAG application code.
- /evaluation/ - Includes the evaluation module for assessing model performance.
- /ingestion/ - Comprises the data ingestion module, responsible for ingesting datasets.
- /llms/ - Contains the implementation of using BART as LLM.
- /retrieval/ - Features the retrieval module for fetching contexts.
- /configs.py - Defines different types for components inside a RAGBuilder.
- /rag.py - Implements the RAGBuilder returning the specific query engine.
/data/ - The main directory for dataset storage, indexes, and source texts.
- /indexes/ - Where index files for the datasets are stored.
- /outputs/ - Saves all json, csv and png file outputs by pipeline. (auto-generated while benchmarking)
- /presentations/ - A directory for storing midterm presentations and poster.
- /report/ - A folder for saving final report in pdf and all latex source files.
- /source-texts/ - The location for raw source texts used in the project.
- /test-folder/ - A directory for testing purposes.
/.gitattributes - Specifies attributes for Git repositories.
/.gitignore - Specifies intentionally untracked files to ignore.
/.pre-commit-config.yaml - Configurations for pre-commit hooks.
/benchmark.sh - Shell script for running benchmark with configurations.
/visualization.sh - Shell script for visualizing benchmark results.
/poetry.lock - The lock file for Poetry, pinning specific versions of dependencies.
/pyproject.toml - The configuration file for Poetry, defining the project and its dependencies.
/rag_pipeline_demo.ipynb - Jupyter notebook for the RAG pipeline demonstration.
/README.md - The readme file for the project, providing an overview and documentation.

Environment setup

Pre-requisites

Python package and environment management install:

Setting Up the Environment

Install Dependencies

Use Poetry to install the necessary dependencies:
```
poetry install
```
Activate the Virtual Environment

Activate the virtual environment created by Poetry:
```
poetry shell
```

Dependency management

We use poetry for dependency management.

Synchronize virtual environment

To synchronize the virtual environment with the requirements, you can use the following:

poetry install --sync

Add new package

To add a new package to the project, use:

poetry add <package-name>

Pre-Commit hooks

Install pre-commit hooks.

pre-commit install

Check if pre-commit hooks are installed correctly.

pre-commit run --all-files

How to Run Benchmarking

To run benchmarking for different approaches using the provided script, follow the steps below:

Running the Benchmark

Use the following command to execute the benchmark with the specified parameters:

./benchmark.sh

Parameters

Please refer to app/configs.py for different parameters.

--llm_type: The type of the language model to use. Example: GPT35_TURBO.
--embedding_type: The type of embedding to use. Example: BGE_SMALL_EN.
--index_type: The type of index to use. Example: VECTOR_INDEX.
--retrieval_type: The type of retrieval to use. Example: DEFAULT.
--evaluation_mode: The mode of evaluation. Options: "both" (for most settings), "traditional" (only for extractive setting), "rag" (in general not useful).
--eval_path: The path to the evaluation dataset. Example: "EdinburghNLP/xsum".
--app_id: The specific and distinct application ID. Example: "default" (set it for distinguishing different approaches).
--num_samples: The number of samples to use for the benchmark. Example: 100.
--reset_json: Whether to reset the JSON results file. Example: 1 (yes), 0 (no).
--reset_csv: Whether to reset the CSV results file. Example: 1 (yes), 0 (no).

Example

To run the benchmark with 100 samples and reset both JSON and CSV results files, use:

python app/evaluation/benchmark.py --llm_type GPT35_TURBO --embedding_type BGE_SMALL_EN --index_type VECTOR_INDEX --retrieval_type DEFAULT --evaluation_mode "both" --eval_path "EdinburghNLP/xsum" --app_id "default" --num_samples 100 --reset_json 1 --reset_csv 1

./benchmark.sh &

This command will start the benchmarking process using the above-specified configurations in the background.

Additional Information

If you are trying to use OpenAI models, please first remember to save your API key in .env at the root.
If you are trying to use local models instead, please also first ensure you have enough GPU memory with at least 16GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG for Source Text Summarization

Team members

Supervisor

Task description

Project structure

Environment setup

Pre-requisites

Setting Up the Environment

Dependency management

Synchronize virtual environment

Add new package

Pre-Commit hooks

How to Run Benchmarking

Running the Benchmark

Parameters

Example

Additional Information

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
README.md		README.md
benchmark.sh		benchmark.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
rag_pipeline_demo.ipynb		rag_pipeline_demo.ipynb
visualization.sh		visualization.sh

Luca-Wiehe/rag_summarization

Folders and files

Latest commit

History

Repository files navigation

RAG for Source Text Summarization

Team members

Supervisor

Task description

Project structure

Environment setup

Pre-requisites

Setting Up the Environment

Dependency management

Synchronize virtual environment

Add new package

Pre-Commit hooks

How to Run Benchmarking

Running the Benchmark

Parameters

Example

Additional Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages