Skip to content

Update tiledb.py vectorstore #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

BBC-Esq
Copy link

@BBC-Esq BBC-Esq commented Jun 11, 2025

Enable 8-bit Vector Types & Extra Distance Metrics in langchain_community/vectorstores/tiledb.py

Background

TileDB-Vector-Search already supports

  • 8-bit vector storage (TILEDB_INT8, TILEDB_UINT8)
  • Distance metricsL2 (Euclidean), squared-L2 (sum-of-squares) and Cosine (TileDB transparently normalises vectors for cosine)
  • INT8 indices since the May-2024 release

The upstream LangChain wrapper always cast embeddings to float32 and exposed only "euclidean".

What this PR adds

Area Change
Metric support INDEX_METRICS now allows "euclidean", "squared_l2" and "cosine", mapped to vspy.DistanceMetric.
Dtype handling Hard-coded astype(np.float32) casts removed. Wrapper accepts np.float32, np.int8, np.uint8. Half-precision inputs (float16,bfloat16) auto-upcast to float32 for storage.
Cosine workflow Normalisation is left to TileDB’s internal routines; wrapper performs no ingest-time or query-time normalisation (except a local copy for MMR post-processing).
Index creation TileDB.create() forwards chosen dtype + metric to flat_index / ivf_flat_index.
Query helper New _prepare_query_vector() guarantees correct shape/dtype, upcasts half-precision if needed.
Ingestion paths from_texts(), from_embeddings(), add_texts() honour an optional vector_dtype parameter and keep the selected dtype end-to-end.
Validation Clear ValueError for unsupported metric/dtype; float16/bfloat16 guard for older NumPy; pickle-safety flag retained.
Backward compatibility Default (float32, "euclidean") behaviour unchanged—existing code runs without modification.

Usage Examples

import numpy as np
from langchain_community.vectorstores import TileDB
from langchain_community.embeddings import SentenceTransformerEmbeddings

emb = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
texts = ["Vector search is fast.", "Cosine similarity loves unit vectors!"]

# 1 – INT8 IVF_FLAT index with Cosine distance
db = TileDB.from_texts(
    texts,
    emb,
    metric="cosine",
    vector_dtype=np.int8,
    index_type="IVF_FLAT",
    index_uri="/tmp/tiledb_int8_cosine",
)

docs = db.similarity_search("speedy vector search", k=2)

# 2 – UINT8 FLAT index with squared-L2
pairs = list(zip(texts, emb.embed_documents(texts)))
db2 = TileDB.from_embeddings(
    pairs,
    emb,
    metric="squared_l2",
    vector_dtype=np.uint8,
    index_type="FLAT",
    index_uri="/tmp/tiledb_uint8_sumsq",
)

# 3 – Load existing index and query
db3 = TileDB.load("/tmp/tiledb_uint8_sumsq", emb, metric="squared_l2")
print(db3.similarity_search_with_score("vector maths", k=1))

@BBC-Esq
Copy link
Author

BBC-Esq commented Jun 11, 2025

@tomaarsen perhaps your could review as well since you're familiar with the sentence transformers side of things?

@BBC-Esq
Copy link
Author

BBC-Esq commented Jun 11, 2025

@ihnorton I forgot to mention that it would be helpful if you could review as well since you're familiar with the tiledb vector search side?

@BBC-Esq
Copy link
Author

BBC-Esq commented Jun 15, 2025

Can this get a review please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant