You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using Qdrant for managing document embeddings, I observed an issue with disk space utilization. When processing a very simple PDF file, the created collection occupies significantly high disk space, even though the snapshot size remains small.
Code Snippet
Below is the code used for creating and loading the collection:
Python
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from pymupdf_loader import PyMuPDFLoader
from qdrant_client.models import VectorParams, Distance
from qdrant_client import QdrantClient, models
collection_name = "xyx"
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L12-v2")
embedding_size = len(embeddings.embed_query("Sample text"))
upload_path = "path of pdf"
loader = PyMuPDFLoader(upload_path)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
qdrant_client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=embedding_size, distance=Distance.COSINE, on_disk=True),
)
vector_store = QdrantVectorStore(
client=qdrant_client,
collection_name=collection_name,
embedding=embeddings,
)
vector_store.add_documents(documents=splits)
Issue
After processing and loading vectors for a very simple PDF document:
The on-disk size of the collection is unexpectedly high.
The snapshot size, however, is relatively small.
Observed Behavior
The discrepancy between the on-disk size and the snapshot size indicates potential inefficiency in storage utilization or metadata overhead.
This behavior might impact scenarios where multiple small PDFs are processed, leading to disproportionately high disk usage.
The text was updated successfully, but these errors were encountered:
While using Qdrant for managing document embeddings, I observed an issue with disk space utilization. When processing a very simple PDF file, the created collection occupies significantly high disk space, even though the snapshot size remains small.
Code Snippet
Below is the code used for creating and loading the collection:
Python
Issue
After processing and loading vectors for a very simple PDF document:
The on-disk size of the collection is unexpectedly high.
The snapshot size, however, is relatively small.
Observed Behavior
The discrepancy between the on-disk size and the snapshot size indicates potential inefficiency in storage utilization or metadata overhead.
This behavior might impact scenarios where multiple small PDFs are processed, leading to disproportionately high disk usage.
The text was updated successfully, but these errors were encountered: