Skip to content

Latest commit

 

History

History
363 lines (304 loc) · 47.8 KB

File metadata and controls

363 lines (304 loc) · 47.8 KB

RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation)

  • RAG (Retrieval-Augmented Generation) : Integrates the retrieval (searching) into LLM text generation. RAG helps the model to “look up” external information to improve its responses. cite [25 Aug 2023]

  • In a 2020 paper, Meta (Facebook) came up with a framework called retrieval-augmented generation to give LLMs access to information beyond their training data. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: [cnt] [22 May 2020]

    1. RAG-sequence — We retrieve k documents, and use them to generate all the output tokens that answer a user query.
    2. RAG-token— We retrieve k documents, use them to generate the next token, then retrieve k more documents, use them to generate the next token, and so on. This means that we could end up retrieving several different sets of documents in the generation of a single answer to a user’s query.
    3. Of the two approaches proposed in the paper, the RAG-sequence implementation is pretty much always used in the industry. It’s cheaper and simpler to run than the alternative, and it produces great results. cite [30 Sep 2023]

Research Papers

  • A Survey on Retrieval-Augmented Text Generation: [cnt]: This paper conducts a survey on retrieval-augmented text generation, highlighting its advantages and state-of-the-art performance in many NLP tasks. These tasks include Dialogue response generation, Machine translation, Summarization, Paraphrase generation, Text style transfer, and Data-to-text generation. [2 Feb 2022]
  • Hyde: Hypothetical Document Embeddings. zero-shot (generate a hypothetical document) -> embedding -> avg vectors -> retrieval [20 Dec 2022]
  • Active Retrieval Augmented Generation : [cnt]: Forward-Looking Active REtrieval augmented generation (FLARE): FLARE iteratively generates a temporary next sentence and check whether it contains low-probability tokens. If so, the system retrieves relevant documents and regenerates the sentence. Determine low-probability tokens by token_logprobs in OpenAI API response. git [11 May 2023]
  • Benchmarking Large Language Models in Retrieval-Augmented Generation: [cnt]: Retrieval-Augmented Generation Benchmark (RGB) is proposed to assess LLMs on 4 key abilities [4 Sep 2023]:
    • Expand
      1. Noise robustness (External documents contain noises, struggled with noise above 80%)

      2. Negative rejection (External documents are all noises, Highest rejection rate was only 45%)

      3. Information integration (Difficulty in summarizing across multiple documents, Highest accuracy was 60-67%)

      4. Counterfactual robustness (Failed to detect factual errors in counterfactual external documents.)

  • Retrieval meets Long Context LLMs: [cnt]: We demonstrate that retrieval-augmentation significantly improves the performance of 4K context LLMs. Perhaps surprisingly, we find this simple retrieval-augmented baseline can perform comparable to 16K long context LLMs. [4 Oct 2023]
  • FreshLLMs: [cnt]: Fresh Prompt, Google search first, then use results in prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. git [5 Oct 2023] GitHub Repo stars
  • Self-RAG: [cnt] 1. Critic model C: Generates reflection tokens (IsREL (relevant,irrelevant), IsSUP (fullysupported,partially supported,nosupport), IsUse (is useful: 5,4,3,2,1)). It is pretrained on data labeled by GPT-4. 2. Generator model M: The main language model that generates task outputs and reflection tokens. It leverages the data labeled by the critic model during training. 3. Retriever model R: Retrieves relevant passages. The LM decides if external passages (retriever) are needed for text generation. git [17 Oct 2023] GitHub Repo stars
  • RECOMP: Improving Retrieval-Augmented LMs with Compressors: [cnt]: 1. We propose RECOMP (Retrieve, Compress, Prepend), an intermediate step which compresses retrieved documents into a textual summary prior to prepending them to improve retrieval-augmented language models (RALMs). 2. We present two compressors – an extractive compressor which selects useful sentences from retrieved documents and an abstractive compressor which generates summaries by synthesizing information from multiple documents. 3. Both compressors are trained. [6 Oct 2023]
  • Retrieval-Augmentation for Long-form Question Answering: [cnt]: 1. The order of evidence documents affects the order of generated answers 2. the last sentence of the answer is more likely to be unsupported by evidence. 3. Automatic methods for detecting attribution can achieve reasonable performance, but still lag behind human agreement. Attribution in the paper assesses how well answers are based on provided evidence and avoid creating non-existent information. [18 Oct 2023]
  • RAG for LLMs: [cnt] 🏆Retrieval-Augmented Generation for Large Language Models: A Survey: Three paradigms of RAG Naive RAG > Advanced RAG > Modular RAG [18 Dec 2023]
  • INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning: INTERS covers 21 search tasks across three categories: query understanding, document understanding, and query-document relationship understanding. The dataset is designed for instruction tuning, a method that fine-tunes LLMs on natural language instructions. git [12 Jan 2024] GitHub Repo stars
  • RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. [16 Jan 2024]
  • The Power of Noise: Redefining Retrieval for RAG Systems: No more than 2-5 relevant docs + some amount of random noise to the LLM context maximizes the accuracy of the RAG. [26 Jan 2024]
  • Corrective Retrieval Augmented Generation (CRAG): Retrieval Evaluator assesses the retrieved documents and categorizes them as Correct, Ambiguous, or Incorrect. For Ambiguous and Incorrect documents, the method uses Web Search to improve the quality of the information. The refined and distilled documents are then used to generate the final output. [29 Jan 2024] CRAG implementation by LangGraph git
  • Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity git [21 Mar 2024] GitHub Repo stars
  • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval: Introduce a novel approach to retrieval-augmented language models by constructing a recursive tree structure from documents. git pip install llama-index-packs-raptor / git [31 Jan 2024] GitHub Repo stars
  • CRAG: Comprehensive RAG Benchmark: a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search ref [7 Jun 2024]
  • PlanRAG: Decision Making. Decision QA benchmark, DQA. Plan -> Retrieve -> Make a decision (PlanRAG) git [18 Jun 2024] GitHub Repo stars
  • Searching for Best Practices in Retrieval-Augmented Generation: Best Performance Practice: Query Classification, Hybrid with HyDE (retrieval), monoT5 (reranking), Reverse (repacking), Recomp (summarization). Balanced Efficiency Practice: Query Classification, Hybrid (retrieval), TILDEv2 (reranking), Reverse (repacking), Recomp (summarization). [1 Jul 2024]
  • Retrieval Augmented Generation or Long-Context LLMs?: Long-Context consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage. [23 Jul 2024]
  • Graph Retrieval-Augmented Generation: A Survey [15 Aug 2024]
  • OP-RAG: Order-preserve RAG: Unlike traditional RAG, which sorts retrieved chunks by relevance, we keep them in their original order from the text. [3 Sep 2024]
  • Retrieval Augmented Generation (RAG) and Beyond:🏆The paper classifies user queries into four levels—explicit, implicit, interpretable rationale, and hidden rationale—and highlights the need for external data integration and fine-tuning LLMs for specialized tasks. [23 Sep 2024]
  • Astute RAG: adaptively extracts essential information from LLMs, consolidates internal and external knowledge with source awareness, and finalizes answers based on reliability. [9 Oct 2024]

Advanced RAG

  • RAG Pipeline
    1. Indexing Stage: Preparing a knowledge base.
    2. Querying Stage: Querying the indexed data to retrieve relevant information.
    3. Responding Stage: Generating responses based on the retrieved information. ref
  • Evaluation with Ragas: UMAP (often used to reduce the dimensionality of embeddings) with Ragas metrics for visualizing RAG results. [Mar 2024] / Ragas provides metrics: Context Precision, Context Relevancy, Context Recall, Faithfulness, Answer Relevance, Answer Semantic Similarity, Answer Correctness, Aspect Critique git [May 2023] GitHub Repo stars
  • Advanced RAG Patterns: How to improve RAG peformance ref / ref [17 Oct 2023]
    1. Data quality: Clean, standardize, deduplicate, segment, annotate, augment, and update data to make it clear, consistent, and context-rich.
    2. Embeddings fine-tuning: Fine-tune embeddings to domain specifics, adjust them according to context, and refresh them periodically to capture evolving semantics.
    3. Retrieval optimization: Refine chunking, embed metadata, use query routing, multi-vector retrieval, re-ranking, hybrid search, recursive retrieval, query engine, HyDE [20 Dec 2022], and vector search algorithms to improve retrieval efficiency and relevance.
    4. Synthesis techniques: Query transformations, prompt templating, prompt conditioning, function calling, and fine-tuning the generator to refine the generation step.
    • HyDE: Implemented in LangChain: HypotheticalDocumentEmbedder. A query generates hypothetical documents, which are then embedded and retrieved to provide the most relevant results. query -> generate n hypothetical documents -> documents embedding - (avg of embeddings) -> retrieve -> final result. ref
  • How to optimize RAG pipeline: Indexing optimization [24 Oct 2023]
  • Demystifying Advanced RAG Pipelines: An LLM-powered advanced RAG pipeline built from scratch git [19 Oct 2023] GitHub Repo stars
  • cite [7 Nov 2023] OpenAI has put together a pretty good roadmap for building a production RAG system. Naive RAG -> Tune Chunks -> Rerank & Classify -> Prompt Engineering. In llama_index... 📺
  • 9 Effective Techniques To Boost Retrieval Augmented Generation (RAG) Systems doc: ReRank, Prompt Compression, Hypothetical Document Embedding (HyDE), Query Rewrite and Expansion, Enhance Data Quality, Optimize Index Structure, Add Metadata, Align Query with Documents, Mixed Retrieval (Hybrid Search) [2 Jan 2024]
  • Contextual Retrieval: Contextual Retrieval enhances traditional RAG by using Contextual Embeddings and Contextual BM25 to maintain context during retrieval. [19 Sep 2024]

Agentic RAG

  • From Simple to Advanced RAG (LlamaIndex) ref / doc /💡ref [10 Oct 2023]
  • What is Agentic RAG: The article published by Weaviate. [5 Nov 2024]

Multi-modal RAG (Vision RAG)

GraphRAG

  • Graph RAG (by NebulaGraph): NebulaGraph proposes the concept of Graph RAG, which is a retrieval enhancement technique based on knowledge graphs. demo [8 Sep 2023]
  • GraphRAG (by Microsoft): 1. Global search: Original Documents -> Knowledge Graph (Community Summaries generated by LLM) -> Partial Responses -> Final Response. 2. Local Search: Utilizes vector-based search to find the nearest entities and relevant information. ref / git [24 Apr 2024] GitHub Repo stars
    • GraphRAG Implementation with LlamaIndex [15 Jul 2024]
    • "From Local to Global" GraphRAG with Neo4j and LangChain [09 Jul 2024]
    • LightRAG: Utilizing graph structures for text indexing and retrieval processes. [8 Oct 2024] GitHub Repo stars
    • nano-graphrag: A simple, easy-to-hack GraphRAG implementation [Jul 2024] GitHub Repo stars
    • DRIFT Search: DRIFT search (Dynamic Reasoning and Inference with Flexible Traversal) combines global and local search methods to improve query relevance by generating sub-questions and refining the context using HyDE (Hypothetical Document Embeddings). [31 Oct 2024]
    • Improving global search via dynamic community selection: Dynamic Community Selection narrows the scope by selecting the most relevant communities based on query relevance, utilizing Map-reduce search, reducing costs by 77% without sacrificing output quality [15 Nov 2024]
    • LazyGraphRAG: Reduces costs to 0.1% of full GraphRAG through efficient use of best-first (vector-based) and breadth-first (global search) retrieval and deferred LLM calls [25 Nov 2024]

The Problem with RAG

  • The Problem with RAG
    1. A question is not semantically similar to its answers. Cosine similarity may favor semantically similar texts that do not contain the answer.
    2. Semantic similarity gets diluted if the document is too long. Cosine similarity may favor short documents with only the relevant information.
    3. The information needs to be contained in one or a few documents. Information that requires aggregations by scanning the whole data.
  • Seven Failure Points When Engineering a Retrieval Augmented Generation System: 1. Missing Content, 2. Missed the Top Ranked Documents, 3. Not in Context, 4. Not Extracted, 5. Wrong Format, 6. Incorrect Specificity, 7. Lack of Thorough Testing [11 Jan 2024]
  • Solving the core challenges of Retrieval-Augmented Generation ref [Feb 2024]

RAG Solution Design & Application

RAG Solution Design


RAG Development

  1. Haystack: LLM orchestration framework to build customizable, production-ready LLM applications. [5 May 2020] GitHub Repo stars
  2. Cognita: RAG (Retrieval Augmented Generation) Framework for building modular, open-source applications [Jul 2023] GitHub Repo stars
  3. Canopy: open-source RAG framework and context engine built on top of the Pinecone vector database. [Aug 2023] GitHub Repo stars
  4. RAGflow: Streamlined RAG workflow. Focusing on Deep document understanding [Dec 2023] GitHub Repo stars
  5. AutoRAG: RAG AutoML tool for automatically finds an optimal RAG pipeline for your data. [Jan 2024] GitHub Repo stars
  6. RAGApp: Agentic RAG. Custom GPTs, but deployable in your own cloud infrastructure using Docker. [Apr 2024] GitHub Repo stars
  7. RAG Builder: Automatically create an optimal production-ready Retrieval-Augmented Generation (RAG) setup for your data. [Jun 2024] GitHub Repo stars
  8. MindSearch: An open-source AI Search Engine Framework [Jul 2024] GitHub Repo stars
  9. RAGFoundry: A library designed to improve LLMs ability to use external information by fine-tuning models on specially created RAG-augmented datasets. [5 Aug 2024] GitHub Repo stars
  10. RAGChecker: A Fine-grained Framework For Diagnosing RAG git [15 Aug 2024] GitHub Repo stars

RAG Application

  1. SWIRL AI Connect: SWIRL AI Connect enables you to perform Unified Search and bring in a secure AI Co-Pilot. [Apr 2022] GitHub Repo stars
  2. PaperQA2: High accuracy RAG for answering questions from scientific documents with citations [Feb 2023] GitHub Repo stars
  3. Danswer: Ask Questions in natural language and get Answers backed by private sources: Slack, GitHub, Confluence, etc. [Apr 2023] GitHub Repo stars
  4. PrivateGPT: 100% privately, no data leaks. The API is built using FastAPI and follows OpenAI's API scheme. [May 2023] GitHub Repo stars
  5. quivr: A personal productivity assistant (RAG). Chat with your docs (PDF, CSV, ...) [May 2023] GitHub Repo stars
  6. Verba: Retrieval Augmented Generation (RAG) chatbot powered by Weaviate [Jul 2023] GitHub Repo stars
  7. RAG capabilities of LlamaIndex to QA about SEC 10-K & 10-Q documents: A real world full-stack application using LlamaIndex [Sep 2023] GitHub Repo stars
  8. RAGxplorer: Visualizing document chunks and the queries in the embedding space. [Jan 2024] GitHub Repo stars
  9. Open Source AI Searches: Perplexica:💡Open source alternative to Perplexity AI [Apr 2024] / Marqo [Aug 2022] / txtai [Aug 2020] / Typesense [Jan 2017] / Morphic [Apr 2024] GitHub Repo stars GitHub Repo stars GitHub Repo stars GitHub Repo stars GitHub Repo stars
  10. llm-answer-engine: Build a Perplexity-Inspired Answer Engine Using Next.js, Groq, Mixtral, LangChain, OpenAI, Brave & Serper [Mar 2024] GitHub Repo stars
  11. turboseek: An AI search engine inspired by Perplexity [May 2024] GitHub Repo stars
  12. R2R: R2R (RAG to Riches), the Elasticsearch for RAG. [Feb 2024] GitHub Repo stars
  13. FlashRAG: A Python Toolkit for Efficient RAG Research [Mar 2024] GitHub Repo stars
  14. kotaemon: Open-source clean & customizable RAG UI for chatting with your documents. [Mar 2024] GitHub Repo stars
  15. MedGraphRAG: MedGraphRAG outperforms the previous SOTA model, Medprompt, by 1.1%. git [8 Aug 2024] GitHub Repo stars
  16. HybridRAG: Integrating VectorRAG and GraphRAG with financial earnings call transcripts in Q&A format. [9 Aug 2024] GitHub Repo stars
  17. MemFree: Hybrid AI Search Engine + AI Page Generator. [Jun 2024] GitHub Repo stars
  18. RAGLite: a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite [Jun 2024] GitHub Repo stars
  19. Applications, Frameworks, and User Interface (UI/UX): x-ref

LlamaIndex

  • LlamaIndex (formerly GPT Index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. The high-level API allows users to ingest and query their data in a few lines of code. High-Level Concept: ref / doc:ref / blog:ref / git [Nov 2022] GitHub Repo stars

    Fun fact this core idea was the initial inspiration for GPT Index (the former name of LlamaIndex) 11/8/2022 - almost a year ago!. cite / Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading

    1. Build a data structure (memory tree)
    2. Transverse it via LLM prompting
  • LlamaIndex Toolkits:

    • LlamaHub: A library of data loaders for LLMs git [Feb 2023] GitHub Repo stars
    • LlamaIndex CLI: a command line tool to generate LlamaIndex apps ref [Nov 2023]
    • LlamaParse: A unique parsing tool for intricate documents git [Feb 2024] GitHub Repo stars

LlamaIndex integration with Azure AI

High-Level Concepts

  • Query engine vs Chat engine

    1. The query engine wraps a retriever and a response synthesizer into a pipeline, that will use the query string to fetch nodes (sentences or paragraphs) from the index and then send them to the LLM (Language and Logic Model) to generate a response
    2. The chat engine is a quick and simple way to chat with the data in your index. It uses a context manager to keep track of the conversation history and generate relevant queries for the retriever. Conceptually, it is a stateful analogy of a Query Engine.
  • Storage Context vs Settings (p.k.a. Service Context)

    • Both the Storage Context and Service Context are data classes.

      1. Introduced in v0.10.0, ServiceContext is replaced to Settings object.
      2. Storage Context is responsible for the storage and retrieval of data in Llama Index, while the Service Context helps in incorporating external context to enhance the search experience.
      3. The Service Context is not directly involved in the storage or retrieval of data, but it helps in providing a more context-aware and accurate search experience.
    # The storage context container is a utility container for storing nodes, indices, and vectors.
    class StorageContext:
      docstore: BaseDocumentStore
      index_store: BaseIndexStore
      vector_store: VectorStore
      graph_store: GraphStore
    # NOTE: Deprecated, use llama_index.settings.Settings. The service context container is a utility container for LlamaIndex index and query classes.
    class ServiceContext:
      llm_predictor: BaseLLMPredictor
      prompt_helper: PromptHelper
      embed_model: BaseEmbedding
      node_parser: NodeParser
      llama_logger: LlamaLogger
      callback_manager: CallbackManager
    @dataclass
    class _Settings:
      # lazy initialization
      _llm: Optional[LLM] = None
      _embed_model: Optional[BaseEmbedding] = None
      _callback_manager: Optional[CallbackManager] = None
      _tokenizer: Optional[Callable[[str], List[Any]]] = None
      _node_parser: Optional[NodeParser] = None
      _prompt_helper: Optional[PromptHelper] = None
      _transformations: Optional[List[TransformComponent]] = None

LlamaIndex Tutorial

  • LlamaIndex Overview (Japanese) [17 Jul 2023]

  • Fine-Tuning a Linear Adapter for Any Embedding Model: Fine-tuning the embeddings model requires you to reindex your documents. With this approach, you do not need to re-embed your documents. Simply transform the query instead. [7 Sep 2023]

  • 4 RAG techniques implemented in llama_index / cite [20 Sep 2023] / git GitHub Repo stars

    Expand: 4 RAG techniques
    1. SQL Router Query Engine: Query router that can reference your vector database or SQL database

    2. Sub Question Query Engine: Break down the complex question into sub-questions

    3. Recursive Retriever + Query Engine: Reference node relationships, rather than only finding a node (chunk) that is most relevant.

    4. Self Correcting Query Engines: Use an LLM to evaluate its own output.

  • LlamaIndex Tutorial: A Complete LlamaIndex Guide [18 Oct 2023]

Vector Database Comparison

  • Faiss: Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors. It is used as an alternative to a vector database in the development and library of algorithms for a vector database. It is developed by Facebook AI Research. git [Feb 2017] GitHub Repo stars
  • Milvus (A cloud-native vector database) Embedded git [Sep 2019]: Alternative option to replace PineCone and Redis Search in OSS. It offers support for multiple languages, addresses the limitations of RedisSearch, and provides cloud scalability and high reliability with Kubernetes. GitHub Repo stars
  • Qdrant: Written in Rust. Qdrant (read: quadrant) [May 2020] GitHub Repo stars
  • Pinecone: A fully managed cloud Vector Database. Commercial Product [Jan 2021]
  • Weaviate: Store both vectors and data objects. [Jan 2021] GitHub Repo stars
  • pgvector: Open-source vector similarity search for Postgres [Apr 2021] / pgvectorscale: 75% cheaper than pinecone [Jul 2023] GitHub Repo stars GitHub Repo stars
  • Not All Vector Databases Are Made Equal: Printed version for "Medium" limits. doc [2 Oct 2021]
  • Chroma: Open-source embedding database [Oct 2022] GitHub Repo stars
  • Redis extension for vector search, RedisVL: Redis Vector Library (RedisVL) [Nov 2022] GitHub Repo stars
  • A SQLite extension for efficient vector search, based on Faiss! [Jan 2023] GitHub Repo stars
  • lancedb: LanceDB's core is written in Rust and is built using Lance, an open-source columnar format. [Feb 2023] GitHub Repo stars
  • A Comprehensive Survey on Vector Database: Categorizes search algorithms by their approach, such as hash-based, tree-based, graph-based, and quantization-based. [18 Oct 2023]

Vector Database Options for Azure

Note: Azure Cache for Redis Enterprise: Enterprise Sku series are not able to deploy by a template such as Bicep and ARM.

Deploy to Azure

Embedding

  • Azure Open AI Embedding API, text-embedding-ada-002, supports 1536 dimensions. Elastic search, Lucene based engine, supports 1024 dimensions as a max. Open search can insert 16,000 dimensions as a vector storage. Open search is available to use as a vector database with Azure Open AI Embedding API.
  • OpenAI Embedding models: text-embedding-3 x-ref > New embedding models
  • text-embedding-ada-002: Smaller embedding size. The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost effective in working with vector databases. [15 Dec 2022]
  • However, one exception to this is that the maximum dimension count for the Lucene engine is 1,024, compared with 16,000 for the other engines. ref
  • Vector Search with OpenAI Embeddings: Lucene Is All You Need: Our experiments were based on Lucene 9.5.0, but indexing was a bit tricky because the HNSW implementation in Lucene restricts vectors to 1024 dimensions, which was not sufficient for OpenAI’s 1536-dimensional embeddings. Although the resolution of this issue, which is to make vector dimensions configurable on a per codec basis, has been merged to the Lucene source trunk git, this feature has not been folded into a Lucene release (yet) as of early August 2023. [29 Aug 2023]
  • Is Cosine-Similarity of Embeddings Really About Similarity?: In linear matrix factorization, the use of regularization can impact, and in some cases, render cosine similarities meaningless. Regularization involves two objectives. The first objective applies L2-norm regularization to the product of matrices A and B, a process similar to dropout. The second objective applies L2-norm regularization to each individual matrix, similar to the weight decay technique used in deep learning. [8 Mar 2024]
  • Contextual Document Embedding (CDE): Improve document retrieval by embedding both queries and documents within the context of the broader document corpus. ref [3 Oct 2024]
  • Fine-tuning Embeddings for Specific Domains [1 Oct 2024]