Skip to content

feat: Adding multi modal support for PGVectorStore #207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

dishaprakash
Copy link
Collaborator

feat: Adding multi modal support for PGVectorStore

  1. Add new image search APIs similarity_search_image() and asimilarity_search_image().
  2. Add add_images() and aadd_images() endpoints to add images to vector store.
  3. Add tests.

gcs_uri = re.match("gs://(.*?)/(.*)", uri)
if gcs_uri:
bucket_name, object_name = gcs_uri.groups()
storage_client = storage.Client()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to wrap this in a try except block to provide a more clear error or do you think the error is clear if they are not running in a Google Cloud environment or have set up credentials.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other langchain packages don't have running integrations tests for 3P providers. We could mock this test or just test this functionality in our package downstream.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is the error
google.auth.exceptions.DefaultCredentialsError: Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc for more information.
I think this is pretty descriptive, let me know what you think.

The options for the tests are:

  1. If we want we can try making the images publicly accessible on a GCP project (which claims that we would not need credentials to fetch it).
  2. We could also store the image directly and skip testing the pathway of GCP.
  3. Not test the add_images at all.

What do you suggest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GCS storage call can be easily mock, let's go ahead and do that. If it can't let's keep the test but skip it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the mock solutions may need more debugging, I've removed the gcs uri from being tested, the other images being created locally are still under the test.
I will recreate the GCS path testing in our libraries.

@dishaprakash dishaprakash marked this pull request as ready for review May 9, 2025 16:43
@@ -365,6 +369,92 @@ async def aadd_documents(
ids = await self.aadd_texts(texts, metadatas=metadatas, ids=ids, **kwargs)
return ids

def _encode_image(self, uri: str) -> str:
"""Get base64 string from a image URI."""
gcs_uri = re.match("gs://(.*?)/(.*)", uri)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unusual to parse a URI pointing to a bucket with a regular expression.

Could you instead confirm that the prefix is gs:// then identify the bucket name and object name etc.


web_uri = re.match(r"^(https?://).*", uri)
if web_uri:
response = requests.get(uri, stream=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an SSRF attack


async def aadd_images(
self,
uris: list[str],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting URIs without safe guards is an SSRF attack

import uuid
from typing import Any, Callable, Iterable, Optional, Sequence

import numpy as np
import requests
from google.cloud import storage # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The dependency should be optional not required. So the import cannot appear in the global namespace.
  2. The actual implementation should be against a key-value store interface not specifically google cloud storage. You can use the LangChain key-value store abstraction to support cloud storage.

blob = bucket.blob(object_name)
return base64.b64encode(blob.download_as_bytes()).decode("utf-8")

web_uri = re.match(r"^(https?://).*", uri)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regular expressions are not a good solution here. Simple prefix matching is more robust and probably will also be faster in terms of actual run time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants