High Disk Usage for Simple PDF with Qdrant Collection #867

111kannan · 2024-12-19T02:56:14Z

While using Qdrant for managing document embeddings, I observed an issue with disk space utilization. When processing a very simple PDF file, the created collection occupies significantly high disk space, even though the snapshot size remains small.

Code Snippet
Below is the code used for creating and loading the collection:

Python

from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from pymupdf_loader import PyMuPDFLoader
from qdrant_client.models import VectorParams, Distance
from qdrant_client import QdrantClient, models

collection_name = "xyx"
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L12-v2")
embedding_size = len(embeddings.embed_query("Sample text"))
upload_path = "path of pdf"
loader = PyMuPDFLoader(upload_path)

docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=embedding_size, distance=Distance.COSINE, on_disk=True),
)

vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=collection_name,
    embedding=embeddings,
)
vector_store.add_documents(documents=splits)

Issue
After processing and loading vectors for a very simple PDF document:

The on-disk size of the collection is unexpectedly high.
The snapshot size, however, is relatively small.

Observed Behavior
The discrepancy between the on-disk size and the snapshot size indicates potential inefficiency in storage utilization or metadata overhead.

This behavior might impact scenarios where multiple small PDFs are processed, leading to disproportionately high disk usage.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Disk Usage for Simple PDF with Qdrant Collection #867

High Disk Usage for Simple PDF with Qdrant Collection #867

111kannan commented Dec 19, 2024

High Disk Usage for Simple PDF with Qdrant Collection #867

High Disk Usage for Simple PDF with Qdrant Collection #867

Comments

111kannan commented Dec 19, 2024