Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Disk Usage for Simple PDF with Qdrant Collection #867

Open
111kannan opened this issue Dec 19, 2024 · 0 comments
Open

High Disk Usage for Simple PDF with Qdrant Collection #867

111kannan opened this issue Dec 19, 2024 · 0 comments

Comments

@111kannan
Copy link

While using Qdrant for managing document embeddings, I observed an issue with disk space utilization. When processing a very simple PDF file, the created collection occupies significantly high disk space, even though the snapshot size remains small.

Code Snippet
Below is the code used for creating and loading the collection:

Python

from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from pymupdf_loader import PyMuPDFLoader
from qdrant_client.models import VectorParams, Distance
from qdrant_client import QdrantClient, models

collection_name = "xyx"
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L12-v2")
embedding_size = len(embeddings.embed_query("Sample text"))
upload_path = "path of pdf"
loader = PyMuPDFLoader(upload_path)

docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=embedding_size, distance=Distance.COSINE, on_disk=True),
)

vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=collection_name,
    embedding=embeddings,
)
vector_store.add_documents(documents=splits)

Issue
After processing and loading vectors for a very simple PDF document:

The on-disk size of the collection is unexpectedly high.
The snapshot size, however, is relatively small.

Observed Behavior
The discrepancy between the on-disk size and the snapshot size indicates potential inefficiency in storage utilization or metadata overhead.

This behavior might impact scenarios where multiple small PDFs are processed, leading to disproportionately high disk usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant