-
Notifications
You must be signed in to change notification settings - Fork 1.3k
CSHARP-5763: Auto-embedding for vector search #1842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note that I have not been able to test this against a server that fully supports auto-embedding. The version I have access to throws for a variety of cases, has no available embedding models, and index creation never completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for auto-embedding in vector search indexes, allowing MongoDB Atlas to automatically create vector embeddings from text fields using specified embedding models (e.g., "voyage-4"). The implementation includes new API constructors, query vector handling for text input, and comprehensive test coverage.
Key changes:
- New
VectorEmbeddingModalityenum to specify the type of data being embedded QueryVectornow supports text strings for auto-embedding indexesCreateVectorSearchIndexModelhas new constructors for auto-embedding indexes with support for compression profiles and explicit compression settings
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| VectorEmbeddingModality.cs | New enum defining modality types for auto-embedding (currently only Text) |
| QueryVector.cs | Added string constructor and implicit operator for text-based queries |
| PipelineStageDefinitionBuilder.cs | Modified to use "query" field for text vs "queryVector" for arrays |
| CreateVectorSearchIndexModel.cs | Added auto-embedding constructors, compression profile support, and updated rendering logic |
| VectorSearchTests.cs | Added test for auto-embedding vector search with new Movie class |
| AtlasSearchIndexManagmentTests.cs | Added comprehensive tests for auto-embedding index creation with various options and renamed existing tests for consistency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/MongoDB.Driver.Tests/Search/AtlasSearchIndexManagmentTests.cs
Outdated
Show resolved
Hide resolved
tests/MongoDB.Driver.Tests/Search/AtlasSearchIndexManagmentTests.cs
Outdated
Show resolved
Hide resolved
| /// indexes, this is only used when specifying explicit field compression using <see cref="Quantization"/>. | ||
| /// </summary> | ||
| public int Dimensions { get; } | ||
| public int Dimensions { get; init; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these fields be called CompressionDimensions and CompressionQuantization similar to CompressionProfileName ?
I think we need to make it more clear that this translates to compression and compression.quantization fields.
Another thought: would wrapping this in Compression object with
CompressionProfile : Compression
{
string ProfileName;
}
CompressionQuantization : Compression
{
int Dimensions;
Quantization Quantization
}
be an overkill ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll follow up on this. I initially created these as separate things in the API, but then I realized when documenting them that they were the same, except one said "for auto-embedding" and the other "not for auto-embedding." As far as I can tell, Quantization and Dimensions have the same meaning either way, so duplicating these properties didn't seem helpful.
I can add to the documentation what structure they end up in the BSON in each case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this code for now, since compression is not included in this release.
|
|
||
| /// <summary> | ||
| /// Type of automatic vector quantization for your vectors. | ||
| /// Type of automatic vector quantization for your vectors. At most one of <see cref="Quantization"/> and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that VectorQuantization is reusable across fields.quantization and fields.compression.quantization fields.
At this point looks like we need to add binary_rescore value to VectorQuantization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this isn't clear, I will follow up on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not doing this for now, since current release doesn't include compression. We can re-visit once that part is finalized.
| else | ||
| { | ||
| vectorField.Add("quantization", BsonString.Create(Quantization.ToString()?.ToLower())); | ||
| vectorField.Add("type", "autoEmbed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we are hiding type, do we want to consider having
CreateVectorSearchIndexModel
and
CreateVectorSearchAutoEmbedIndexModel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I'm going to assume all binary breaking changes are off the table, since otherwise this API would not be the compromise it is in the first place. (If binary changes are allowed then a) we should re-do the part of this API that is already checked in to clean it up, and b_) we should probably introduce a new subclass for vector indexes, which can have one or two subclasses for the kinds of vector indexes.)
Given that, we can create a new type which means that:
- Much of the existing configuration for vector indexes will be duplicated.
- When we add something new to vector indexes in the future, then it will likely need to be duplicated as well.
Or we can keep the same time which means that:
- Some property combinations will (currently) not be valid.
- It's potentially unclear which kind of vector index you are creating.
I opted for the second choice since, on balance, it seems a better long-term choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we didn't release typed index builders (CSHARP-5717) yet. So we got some flexibility here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied to wrong comment yesterday: Pushed a new design with subclasses for different vector index types. We will have to pull members down into the base type as they are also supported in auto-embed indexes.
src/MongoDB.Driver/QueryVector.cs
Outdated
| { | ||
| Ensure.IsNotNull(bsonText, nameof(bsonText)); | ||
|
|
||
| Vector = bsonText; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can cover these changes in QueryVectorTests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| var vectorSearchOperator = new BsonDocument | ||
| { | ||
| { "queryVector", queryVector.Vector }, | ||
| { queryData is BsonString ? "query" : "queryVector", queryData }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through the spec, looks like model is not strictly required as of now, but might be in the future. Do we want to add it later or now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem useful to me--it becomes a "quiz API". That is, I know that you need to pass this as the model, because that is how you created the index, but I'm going to ask you to tell me anyway, and if you get it right, I will proceed, while if you get it wrong, I will stop. That is generally not a useful thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed up privately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added model to options.
| else | ||
| { | ||
| vectorField.Add("quantization", BsonString.Create(Quantization.ToString()?.ToLower())); | ||
| vectorField.Add("type", "autoEmbed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we didn't release typed index builders (CSHARP-5717) yet. So we got some flexibility here.
| var vectorSearchOperator = new BsonDocument | ||
| { | ||
| { "queryVector", queryVector.Vector }, | ||
| { queryData is BsonString ? "query" : "queryVector", queryData }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed up privately.
…elease. - Changing QueryVector to take string instead of BsonString - Added model to vector query options - Fixed lookup for index status on community server - Added more query tests - Changed query tests to run on small set of documents that can be queried in a reasonable amount of time.
BorisDog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments
tests/MongoDB.Driver.Tests/Search/AtlasSearchIndexManagmentTests.cs
Outdated
Show resolved
Hide resolved
BorisDog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some failing unittests.
| } | ||
|
|
||
| /// <inheritdoc/> | ||
| internal override BsonDocument Render(RenderArgs<TDocument> renderArgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for CreateSearchIndexModel:
Should we introduce
BsonDocument Render(RenderArgs<TDocument> renderArgs) => Definition
for more uniform handling in CreateCreateIndexesOperation?
Also now that Render is internal, the exception in CreateSearchIndexModel.Definition is not accurate. Is it possible to override that property and throw there? I think making a property is acceptable as breaking change?
| ]); | ||
|
|
||
| _collection.SearchIndexes.CreateOne(new CreateAutoEmbeddingVectorSearchIndexModel<Movie>( | ||
| e => e.Plot, _autoEmbedIndexName, "voyage-4", filterFields: [e => e.Runtime, e => e.Year])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we introduce a temporary expectation for a failure here?
Note that I have not been able to test this against a server that fully supports auto-embedding. The version I have access to throws for a variety of cases, has no available embedding models, and index creation never completes.