Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up insert process? #212

Open
fahadh4ilyas opened this issue Nov 5, 2024 · 16 comments
Open

How to speed up insert process? #212

fahadh4ilyas opened this issue Nov 5, 2024 · 16 comments
Labels
good first issue Good for newcomers

Comments

@fahadh4ilyas
Copy link

The insert process is quite slow for a small document. I tried to change llm_model_max_async value but the speed is never change. I also saw that the insert process is only using single core of my CPU. Is there any way to speed up the process? Maybe by using multiple thread or process?

@JavieHush
Copy link
Contributor

Try to use GPU instead, the spped will boost up. The insert process of LightRAG is much faster than that in GraphRAG, based on my actual testing.

@abylikhsanov
Copy link
Contributor

@JavieHush Can you elaborate on that more?

@Jaykumaran
Copy link

@JavieHush

Facing same issue, Could you describe how to achieve this?

@JavieHush
Copy link
Contributor

Guys :) I'm not quite sure about the situation you've encountered. my detailed situation is as follows

Suggestions

The insert process is highly related to LLM/Embedding model (the process use LLM to extract entities & relations, and EB model to index). This requires a significant amount of computing resources. If you run this locally, a GPU-accelerated model is recommended. if use CPU only, it will be much slower.
And, use a model with few params may have a higher processing speed. (But be aware that a model with fewer params may have a worse performance. So you must make a balance)
Also, I noticed that using external graph DB & Vector DB may accelerate the insert process.(also accelerate the query process) We're currently working on how to integrate all these.

About my situation

we use Ollama local service to power the framework, and a work station with 8 × Tesla P100 GPU.

Evaluation

Using a fake fairy tale (2k tokens, generated by GPT-4o, this means all LLMs don't know this story) to test the LightRAG & GraphRAG. The insert process of LightRAG cost 2~3min, while it costs more than 15min for GraphRAG.

@abylikhsanov
Copy link
Contributor

@JavieHush That is why I got confused as in my situation I am not running LLM locally but rather using APIs so wondered what did you mean by using GPU.

@JavieHush
Copy link
Contributor

@JavieHush That is why I got confused as in my situation I am not running LLM locally but rather using APIs so wondered what did you mean by using GPU.

btw, how long did it cost for u to finish the inserting process? It should be much faster using API than local model service🤔

@abylikhsanov
Copy link
Contributor

@JavieHush I used different document which at the end had 3k entities. I used 6.1 million GPT4o mini tokens and around 1 million embedding tokens (which is very cheap). So around $1

@Jaykumaran
Copy link

Jaykumaran commented Nov 6, 2024

@JavieHush I'm running locally with ollama, can you explain the process to make use to GPU while indexing.

import os
import logging
from lightrag import LightRAG, QueryParam
from lightrag.llm import ollama_model_complete, ollama_embedding
from lightrag.utils import EmbeddingFunc
import pdfplumber

######## Environment="OLLAMA_KEEP_ALIVE=-1"

WORKING_DIR = "./mydir"

logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)

if not os.path.exists(WORKING_DIR):
os.mkdir(WORKING_DIR)

rag = LightRAG(
working_dir=WORKING_DIR,
chunk_token_size=1200, # 1200 based on resources
llm_model_func=ollama_model_complete,
llm_model_name="qwen2.5",
llm_model_max_async=4, # reduce to 4 or 8 depending on cpu and mem resources
llm_model_max_token_size=32768,
llm_model_kwargs={"host": "http://localhost:11434", "options": {"num_ctx": 32768}},
embedding_func=EmbeddingFunc(
embedding_dim=768,
max_token_size=8192,
func=lambda texts: ollama_embedding(texts, embed_model="nomic-embed-text", host="http://localhost:11434"),
),
)

pdf_path = "../CompaniesAct2013.pdf"

pdf_text = ""

with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
pdf_text += page.extract_text() + "\n"

rag.insert(pdf_text)

print(rag.query("What are the top themes in this story?", param=QueryParam(mode="naive")))

print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global")))

print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global")))

print(rag.query("What are the top themes in this story?", param=QueryParam(mode="hybrid")))

@JavieHush
Copy link
Contributor

@JavieHush I'm running locally with ollama, can you explain the process to make use to GPU while indexing.

import os import logging from lightrag import LightRAG, QueryParam from lightrag.llm import ollama_model_complete, ollama_embedding from lightrag.utils import EmbeddingFunc import pdfplumber

######## Environment="OLLAMA_KEEP_ALIVE=-1"

WORKING_DIR = "./mydir"

logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)

if not os.path.exists(WORKING_DIR): os.mkdir(WORKING_DIR)

rag = LightRAG( working_dir=WORKING_DIR, chunk_token_size=1200, # 1200 based on resources llm_model_func=ollama_model_complete, llm_model_name="qwen2.5", llm_model_max_async=4, # reduce to 4 or 8 depending on cpu and mem resources llm_model_max_token_size=32768, llm_model_kwargs={"host": "http://localhost:11434", "options": {"num_ctx": 32768}}, embedding_func=EmbeddingFunc( embedding_dim=768, max_token_size=8192, func=lambda texts: ollama_embedding(texts, embed_model="nomic-embed-text", host="http://localhost:11434"), ), )

pdf_path = "../CompaniesAct2013.pdf"

pdf_text = ""

with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: pdf_text += page.extract_text() + "\n"

rag.insert(pdf_text)

print(rag.query("What are the top themes in this story?", param=QueryParam(mode="naive")))

print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global")))

print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global")))

print(rag.query("What are the top themes in this story?", param=QueryParam(mode="hybrid")))

First of all you must make sure your GPU support accelerating model reasoning, are u using Nvidia series or ?

GPU accelerating setting should be configured in ollama settings.

plz refer to Run ollama with docker-compose and using gpu

@LarFii LarFii added the good first issue Good for newcomers label Nov 7, 2024
@LarFii LarFii pinned this issue Nov 7, 2024
@aiproductguy
Copy link
Contributor

I have been able to offload the insert processes onto a free cloud service (streamlit.io) and provide a insert, query, visualize, download buttons on a LightRAG Gui.

This does not exactly speed up insert, but it does offload compute in case you are constrained by local device resources.

@chandasampath
Copy link

Can we do parallel inserts to RAG? Did anyone try?

@davidleon
Copy link
Contributor

actually take a look at jina example. it's inserting docs concurrently. however, i didn't investigate much about the entity race issue. my suggestion is don't set the concurrency too high. only if you know exactly what you do.

@timelesshc
Copy link

timelesshc commented Dec 30, 2024

@JavieHush That is why I got confused as in my situation I am not running LLM locally but rather using APIs so wondered what did you mean by using GPU.

I'm facing the same issue using API call. It took me around 3 mins to process 1 text chunk where each chunk has only 600 tokens. My documents have 150+ chunks so the process is quite slow. Did you figure out how to speed up the insert process using API?

@PoyBoi
Copy link

PoyBoi commented Jan 1, 2025

Guys :) I'm not quite sure about the situation you've encountered. my detailed situation is as follows

Suggestions

The insert process is highly related to LLM/Embedding model (the process use LLM to extract entities & relations, and EB model to index). This requires a significant amount of computing resources. If you run this locally, a GPU-accelerated model is recommended. if use CPU only, it will be much slower. And, use a model with few params may have a higher processing speed. (But be aware that a model with fewer params may have a worse performance. So you must make a balance) Also, I noticed that using external graph DB & Vector DB may accelerate the insert process.(also accelerate the query process) We're currently working on how to integrate all these.

About my situation

we use Ollama local service to power the framework, and a work station with 8 × Tesla P100 GPU.

Evaluation

Using a fake fairy tale (2k tokens, generated by GPT-4o, this means all LLMs don't know this story) to test the LightRAG & GraphRAG. The insert process of LightRAG cost 2~3min, while it costs more than 15min for GraphRAG.

Funnily enough, I do have decent specs (an NVIDIA gpu) and I have torch running and it detects cuda as well. But for some reason it doesn't work with GPU enabled acceleration. One difference is that I am using HF models instead of Ollama, as you described.

I peeked the source code, and it does seem to be offsetting the embedding model to cuda, yet it still has 0% gpu2 usage.

@taras-bl
Copy link

taras-bl commented Jan 2, 2025

Guys :) I'm not quite sure about the situation you've encountered. my detailed situation is as follows

Suggestions

The insert process is highly related to LLM/Embedding model (the process use LLM to extract entities & relations, and EB model to index). This requires a significant amount of computing resources. If you run this locally, a GPU-accelerated model is recommended. if use CPU only, it will be much slower. And, use a model with few params may have a higher processing speed. (But be aware that a model with fewer params may have a worse performance. So you must make a balance) Also, I noticed that using external graph DB & Vector DB may accelerate the insert process.(also accelerate the query process) We're currently working on how to integrate all these.

About my situation

we use Ollama local service to power the framework, and a work station with 8 × Tesla P100 GPU.

Evaluation

Using a fake fairy tale (2k tokens, generated by GPT-4o, this means all LLMs don't know this story) to test the LightRAG & GraphRAG. The insert process of LightRAG cost 2~3min, while it costs more than 15min for GraphRAG.

Funnily enough, I do have decent specs (an NVIDIA gpu) and I have torch running and it detects cuda as well. But for some reason it doesn't work with GPU enabled acceleration. One difference is that I am using HF models instead of Ollama, as you described.

I peeked the source code, and it does seem to be offsetting the embedding model to cuda, yet it still has 0% gpu2 usage.

@PoyBoi
Copy link

PoyBoi commented Jan 2, 2025

Funnily enough, I do have decent specs (an NVIDIA gpu) and I have torch running and it detects cuda as well. But for some reason it doesn't work with GPU enabled acceleration. One difference is that I am using HF models instead of Ollama, as you described.

I peeked the source code, and it does seem to be offsetting the embedding model to cuda, yet it still has 0% gpu2 usage.

Found a fix for this:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

rag = LightRAG(
    working_dir=WORKING_DIR,
    llm_model_func = lambda *args, **kwargs: hf_model_complete(
        *args, device=device, **kwargs
    ),
    llm_model_name=llm_model,
    embedding_func=EmbeddingFunc(
        embedding_dim = 768,
        max_token_size = 5000,
        func=lambda texts: hf_embedding(
            texts,
            tokenizer=AutoTokenizer.from_pretrained(tokenizer_model),
            embed_model=AutoModel.from_pretrained(tokenizer_model).to(device),
        ),
    ),
    # Extra
    llm_model_kwargs={
        "quantization_config": quantization_config,
        "device_map": "auto"
    },
    addon_params={"insert_batch_size": 20},
    llm_model_max_async=30,
)

The processing times went from 1hr+ to ~12 minutes with ~80% GPU usage and about 2gb VRAM (depending on the embedding model you choose to infer with)

PS: don't forget to do the following:

Install:

pip install -U bistandbytes

and

Import:

from transformers import BitsAndBytesConfig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests