Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[use cases] Vertical RAG search + streamlit #811

Open
Undertone0809 opened this issue Jul 22, 2024 · 1 comment
Open

[use cases] Vertical RAG search + streamlit #811

Undertone0809 opened this issue Jul 22, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@Undertone0809
Copy link
Owner

🚀 Feature Request

使用 streamlit 构建一个基于 web 的 AI 搜索引擎,拥有以下能力:

  • 输出结果可以有 reference
  • 输出“猜你想问”

References 1

实现垂类 AI 搜索引擎 SOP👇

确定三个核心问题:

  1. source list 从哪些地方检索数据
  2. answer prompt 使用什么提示词模板回复
  3. llm model 使用哪个大语言模型回复

搜索前query rewrite:

  1. 结合历史消息,判断当前 query 是否需要 retrieve
  2. 结合历史消息,做指代消解,把代词替换成具体的名词
  3. 从指代消解后的 query 提取关键词 keywords

RAG 流程

  1. 使用query + keywords 作为入参,从source list 获取检索结果(在线API检索+本地index检索),必要时可对 query + keywords 进行翻译,使用不同语言进行多轮检索
  2. 检索结果聚合重排reranking
  3. 获取重排后 top_k 条内容详情
  4. 使用回复提示词 + 检索内容 + 历史消息作为 context,带上最新 query 请求 LLM 回复

主要工程量

  1. 对内容源 build index

对于没有标准API的source,需要对source站点的数据构建索引。增量构建使用source的搜索框,存量构建使用搜索引擎网页快照,很难拿到某个 source 的全量数据

  1. 更新 source 权重

系统预置权重 + 用户点击更新 source 权重,多信息源检索时依据 source 权重返回结果数量和初始排序

  1. 多信息源重排

需要一个高效/快速的 reranking 框架,比如 FlashRank

  1. 构建 chunk 内容池

对检索到的内容进行 chunk 拆分,存储向量数据库,挂载上下文请求 LLM 回答时,相似度匹配部分内容,避免暴力传输

  1. 构建关键词库

定期分析历史 query,提取热搜关键词,构建关键词库。命中关键词库的 query,retrieve 环节走缓存

image


@Undertone0809 Undertone0809 added the enhancement New feature or request label Jul 22, 2024
@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


🚀 Feature Request

Use streamlit to build a web-based AI search engine with the following capabilities:

  • The output results can have references
  • Output "guess what you want to ask"

References 1

Implement vertical AI search engine SOP👇

Identify three core issues:

  1. Source list retrieves data from where
  2. answer prompt What prompt word template should be used to reply?
  3. Which large language model does llm model use?

Query rewrite before search:

  1. Combined with historical information, determine whether the current query needs to be retrieved
  2. Combine historical information with reference resolution and replace pronouns with specific nouns
  3. Extract keywords keywords from the query after reference resolution

#RAG process

  1. Use query + keywords as input parameters to obtain search results from the source list (online API search + local index search). If necessary, you can translate query + keywords and use different languages ​​to conduct multiple rounds of searches.
  2. Aggregation and reranking of search results
  3. Get top_k content details after rearrangement
  4. Use reply prompt words + search content + historical messages as context, and bring the latest query to request LLM reply

#Main project quantities

  1. Build index on the content source

For sources that do not have a standard API, it is necessary to index the data of the source site. Incremental construction uses the search box of the source, and stock construction uses search engine webpage snapshots. It is difficult to obtain the full data of a certain source.

  1. Update source weight

The system preset weight + the user clicks to update the source weight. When retrieving multiple information sources, the number of results and initial sorting are returned based on the source weight.

  1. Rearrangement of multiple information sources

Need an efficient/fast reranking framework such as FlashRank

  1. Build chunk content pool

Split the retrieved content into chunks, store it in the vector database, and mount the context to request LLM responses. The similarity matches part of the content to avoid violent transmission.

  1. Build a keyword database

Regularly analyze historical queries, extract hot search keywords, and build a keyword database. If the query hits the keyword library, the retrieve link will be cached.

image


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants