-
Notifications
You must be signed in to change notification settings - Fork 78
feat: Adding multi modal support for PGVectorStore #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
gcs_uri = re.match("gs://(.*?)/(.*)", uri) | ||
if gcs_uri: | ||
bucket_name, object_name = gcs_uri.groups() | ||
storage_client = storage.Client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to wrap this in a try except block to provide a more clear error or do you think the error is clear if they are not running in a Google Cloud environment or have set up credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other langchain packages don't have running integrations tests for 3P providers. We could mock this test or just test this functionality in our package downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this is the error
google.auth.exceptions.DefaultCredentialsError: Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc for more information.
I think this is pretty descriptive, let me know what you think.
The options for the tests are:
- If we want we can try making the images publicly accessible on a GCP project (which claims that we would not need credentials to fetch it).
- We could also store the image directly and skip testing the pathway of GCP.
- Not test the add_images at all.
What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If GCS storage call can be easily mock, let's go ahead and do that. If it can't let's keep the test but skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the mock solutions may need more debugging, I've removed the gcs uri from being tested, the other images being created locally are still under the test.
I will recreate the GCS path testing in our libraries.
@@ -365,6 +369,92 @@ async def aadd_documents( | |||
ids = await self.aadd_texts(texts, metadatas=metadatas, ids=ids, **kwargs) | |||
return ids | |||
|
|||
def _encode_image(self, uri: str) -> str: | |||
"""Get base64 string from a image URI.""" | |||
gcs_uri = re.match("gs://(.*?)/(.*)", uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unusual to parse a URI pointing to a bucket with a regular expression.
Could you instead confirm that the prefix is gs:// then identify the bucket name and object name etc.
|
||
web_uri = re.match(r"^(https?://).*", uri) | ||
if web_uri: | ||
response = requests.get(uri, stream=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an SSRF attack
|
||
async def aadd_images( | ||
self, | ||
uris: list[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting URIs without safe guards is an SSRF attack
import uuid | ||
from typing import Any, Callable, Iterable, Optional, Sequence | ||
|
||
import numpy as np | ||
import requests | ||
from google.cloud import storage # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The dependency should be optional not required. So the import cannot appear in the global namespace.
- The actual implementation should be against a key-value store interface not specifically google cloud storage. You can use the LangChain key-value store abstraction to support cloud storage.
blob = bucket.blob(object_name) | ||
return base64.b64encode(blob.download_as_bytes()).decode("utf-8") | ||
|
||
web_uri = re.match(r"^(https?://).*", uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regular expressions are not a good solution here. Simple prefix matching is more robust and probably will also be faster in terms of actual run time
feat: Adding multi modal support for PGVectorStore
similarity_search_image()
andasimilarity_search_image()
.add_images()
andaadd_images()
endpoints to add images to vector store.