Skip to content

Commit 6174cbe

Browse files
committed
Update document loader proposal based on feedback
Specifically, added the following: * Combine the two document loaders into a single AzureBlobStorageLoader class. * Encourage blob_parser parameter more as a future consideration. This would be helpful if a customer did not necessarily wanted to use the blob loader interfaces but still wanted more control over how to parse blobs.
1 parent 762a619 commit 6174cbe

File tree

1 file changed

+81
-58
lines changed

1 file changed

+81
-58
lines changed

libs/azure-storage/proposals/document_loaders.md

Lines changed: 81 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -186,47 +186,29 @@ Below is the proposed specification for the Azure Blob Storage document loaders.
186186
All Azure Storage document loaders will live in the [`langchain_azure_storage` package][langchain-azure-storage-pkg]
187187
under a new `document_loaders` module.
188188

189-
There will be two document loaders introduced:
189+
There will be a single document loader introduced, `AzureBlobStorageLoader`. This single loader will encompass
190+
functionality from both the community-sourced `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader`
191+
document loaders.
190192

191-
* `AzureBlobStorageFileLoader` - Loads a `Document` from a single blob in Azure Blob Storage.
192-
* `AzureBlobStorageContainerLoader` - Loads `Document` objects from all blobs in a given container in Azure Blob Storage.
193-
Assuming no chunking is happening, each `Document` loaded will correspond 1:1 with a blob in the container.
194-
195-
Each document loader will subclass from [`BaseLoader`][langchain-document-loader-base-ref] and support both synchronous
193+
The document loader will subclass from [`BaseLoader`][langchain-document-loader-base-ref] and support both synchronous
196194
and asynchronous loading of documents, as well as lazy loading of documents.
197195

198-
Below show the proposed constructor signatures for each document loader:
196+
Below shows the proposed constructor signature for the document loader:
199197

200198
```python
201-
from typing import Optional, Union, Callable
199+
from typing import Optional, Union, Callable, Iterable
202200
import azure.core.credentials
203201
import azure.core.credentials_async
204202
from langchain_core.document_loaders import BaseLoader
205203

206204

207-
class AzureBlobStorageFileLoader(BaseLoader):
208-
def __init__(self,
209-
account_url: str,
210-
container_name: str,
211-
blob_name: str,
212-
*,
213-
credential: Optional[
214-
Union[
215-
azure.core.credentials.AzureSasCredential,
216-
azure.core.credentials.TokenCredential,
217-
azure.core.credentials_async.AsyncTokenCredential,
218-
]
219-
] = None,
220-
loader_factory: Optional[Callable[str, BaseLoader]] = None,
221-
): ...
222-
223-
224-
class AzureBlobStorageContainerLoader(BaseLoader):
205+
class AzureBlobStorageLoader(BaseLoader):
225206
def __init__(self,
226207
account_url: str,
227208
container_name: str,
209+
blob_names: Optional[Union[str, Iterable[str]]] = None,
228210
*,
229-
prefix: str = "",
211+
prefix: Optional[str] = None,
230212
credential: Optional[
231213
Union[
232214
azure.core.credentials.AzureSasCredential,
@@ -241,13 +223,16 @@ class AzureBlobStorageContainerLoader(BaseLoader):
241223
In terms of parameters supported:
242224
* `account_url` - The URL to the storage account (e.g., `https://<account>.blob.core.windows.net`)
243225
* `container_name` - The name of the container within the storage account
244-
* `blob_name` - (File loader only) The name of the blob within the container to load.
226+
* `blob_names` - The name of the blob(s) within the container to load. If provided, only the specified blob(s)
227+
in the container will be loaded. If not provided, the loader will list blobs from the container to load, which
228+
will be all blobs unless `prefix` is specified.
245229
* `credential` - The credential object to use for authentication. If not provided,
246230
the loader will use [Azure default credentials][azure-default-credentials]. The
247231
`credential` field only supports token-based credentials and SAS credentials. It does
248232
not support access key based credentials nor anonymous access.
249-
* `prefix` - (Container loader only) An optional prefix to filter blobs within the container.
250-
Only blobs whose names start with the specified prefix will be loaded.
233+
* `prefix` - An optional prefix to filter blobs when listing from the container. Only blobs whose names start with the
234+
specified prefix will be loaded. This parameter is incompatible with `blob_names` and will raise a `ValueError` if both
235+
are provided.
251236
* `loader_factory` - A callable that returns a custom document loader (e.g., `UnstructuredLoader`) to use
252237
for parsing blobs downloaded. When provided, the Azure Storage document loader will download each blob to
253238
a temporary local file and then call `loader_factory` with the path to the temporary file to get a document
@@ -263,30 +248,44 @@ Below are some example usage patterns for the Azure Blob Storage document loader
263248
Below shows how to load a document from a single blob in Azure Blob Storage:
264249

265250
```python
266-
from langchain_azure_storage.document_loaders import AzureBlobStorageFileLoader
251+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
267252

268-
loader = AzureBlobStorageFileLoader("https://<account>.blob.core.windows.net", "<container>", "<blob>")
253+
loader = AzureBlobStorageLoader("https://<account>.blob.core.windows.net", "<container>", "<blob>")
269254
for doc in loader.lazy_load():
270255
print(doc.page_content) # Prints content of blob. There should only be one document loaded.
271256
```
272257

258+
### Load from a list of blobs
259+
Below shows how to load documents from a list of blobs in Azure Blob Storage:
260+
261+
```python
262+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
263+
264+
loader = AzureBlobStorageLoader(
265+
"https://<account>.blob.core.windows.net",
266+
"<container>",
267+
["blob1", "blob2", "blob3"]
268+
)
269+
```
270+
273271
#### Load from a container
274272

275273
Below shows how to load documents from all blobs in a given container in Azure Blob Storage:
276274

277275
```python
278-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
276+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
279277

280-
loader = AzureBlobStorageContainerLoader("https://<account>.blob.core.windows.net", "<container>")
278+
loader = AzureBlobStorageLoader("https://<account>.blob.core.windows.net", "<container>")
281279
for doc in loader.lazy_load():
282280
print(doc.page_content) # Prints content of each blob in the container.
283281
```
284282

285283
Below shows how to load documents from blobs in a container with a given prefix:
286284

287285
```python
288-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
289-
loader = AzureBlobStorageContainerLoader(
286+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
287+
288+
loader = AzureBlobStorageLoader(
290289
"https://<account>.blob.core.windows.net", "<container>", prefix="some/prefix/"
291290
)
292291
for doc in loader.lazy_load():
@@ -297,11 +296,11 @@ for doc in loader.lazy_load():
297296
Below shows how to load documents asynchronously. This is acheived by calling the `aload()` or `alazy_load()` methods on the document loader. For example:
298297

299298
```python
300-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
299+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
301300

302301

303302
async def main():
304-
loader = AzureBlobStorageContainerLoader("https://<account>.blob.core.windows.net", "<container>")
303+
loader = AzureBlobStorageLoader("https://<account>.blob.core.windows.net", "<container>")
305304
async for doc in loader.alazy_load():
306305
print(doc.page_content) # Prints content of each blob in the container.
307306
```
@@ -312,9 +311,10 @@ Below shows how to override the default credentials used by the document loader:
312311
```python
313312
from azure.core.credentials import AzureSasCredential
314313
from azure.idenity import ManagedIdentityCredential
314+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
315315

316316
# Override with SAS token
317-
loader = AzureBlobStorageContainerLoader(
317+
loader = AzureBlobStorageLoader(
318318
"https://<account>.blob.core.windows.net",
319319
"<container>",
320320
credential=AzureSasCredential("<sas-token>")
@@ -323,7 +323,7 @@ loader = AzureBlobStorageContainerLoader(
323323

324324
# Override with more specific token credential than the entire
325325
# default credential chain (e.g., system-assigned managed identity)
326-
loader = AzureBlobStorageContainerLoader(
326+
loader = AzureBlobStorageLoader(
327327
"https://<account>.blob.core.windows.net",
328328
"<container>",
329329
credential=ManagedIdentityCredential()
@@ -338,10 +338,10 @@ the `UnstructuredLoader` to parse the local file and return `Document` objects
338338
on behalf of the Azure Storage document loader:
339339

340340
```python
341-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
341+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
342342
from langchain_unstructured import UnstructuredLoader
343343

344-
loader = AzureBlobStorageContainerLoader(
344+
loader = AzureBlobStorageLoader(
345345
"https://<account>.blob.core.windows.net",
346346
"<container>",
347347
# The UnstructuredLoader class accepts a string to the local file path to its constructor,
@@ -358,7 +358,7 @@ If a customer wants to provide additional configuration to the document loader,
358358
define a callable that returns an instantiated document loader. For example, to provide
359359
custom configuration to the `UnstructuredLoader`:
360360
```python
361-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
361+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
362362
from langchain_unstructured import UnstructuredLoader
363363

364364

@@ -370,7 +370,7 @@ def loader_factory(file_path: str) -> UnstructuredLoader:
370370
)
371371

372372

373-
loader = AzureBlobStorageContainerLoader(
373+
loader = AzureBlobStorageLoader(
374374
"https://<account>.blob.core.windows.net", "<container>",
375375
loader_factory=loader_factory
376376
)
@@ -385,11 +385,13 @@ customers will need to perform the following changes:
385385
1. Depend on the `langchain-azure-storage` package instead of `langchain-community`.
386386
2. Update import statements from `langchain_community.document_loaders` to
387387
`langchain_azure_storage.document_loaders`.
388-
3. Update document loader constructor calls to:
388+
3. Change class names from `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader`
389+
to `AzureBlobStorageLoader`.
390+
4. Update document loader constructor calls to:
389391
1. Use an account URL instead of a connection string.
390392
2. Specify `UnstructuredLoader` as the `loader_factory` if they continue to want to use
391393
Unstructured for parsing documents.
392-
4. Ensure environment has proper credentials (e.g., running `azure login` command, setting up
394+
5. Ensure environment has proper credentials (e.g., running `azure login` command, setting up
393395
managed identity, etc.) as the connection string would have previously contained the credentials.
394396

395397
Below shows code snippets of what usage patterns look like before and after the proposed migration:
@@ -414,16 +416,16 @@ file_loader = AzureBlobStorageFileLoader(
414416
**After migration:**
415417

416418
```python
417-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader, AzureBlobStorageFileLoader
419+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
418420
from langchain_unstructured import UnstructuredLoader
419421

420-
container_loader = AzureBlobStorageContainerLoader(
422+
container_loader = AzureBlobStorageLoader(
421423
"https://<account>.blob.core.windows.net",
422424
"<container>",
423425
loader_factory=UnstructuredLoader
424426
)
425427

426-
file_loader = AzureBlobStorageFileLoader(
428+
file_loader = AzureBlobStorageLoader(
427429
"https://<account>.blob.core.windows.net",
428430
"<container>",
429431
"<blob>",
@@ -464,16 +466,16 @@ When a `credential` is provided, the credential will be:
464466
```python
465467
import azure.identity
466468
import azure.identity.aio
467-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
469+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
468470

469-
sync_doc_loader = AzureBlobStorageContainerLoader(
471+
sync_doc_loader = AzureBlobStorageLoader(
470472
"https://<account>.blob.core.windows.net",
471473
"<container>",
472474
credential=azure.identity.ManagedIdentityCredential()
473475
)
474476
sync_doc_loader.aload() # Raises ValueError because a sync credential was provided
475477

476-
async_doc_loader = AzureBlobStorageContainerLoader(
478+
async_doc_loader = AzureBlobStorageLoader(
477479
"https://<account>.blob.core.windows.net",
478480
"<container>",
479481
credential=azure.identity.aio.ManagedIdentityCredential()
@@ -490,9 +492,9 @@ When a `credential` is provided, the credential will be:
490492
By default, the document loaders will populate the `source` metadata field of each `Document`
491493
object with the URL of the blob (e.g., `https://<account>.blob.core.windows.net/<container>/<blob>`). For example:
492494
```python
493-
from langchain_azure_storage.document_loaders import AzureBlobStorageContainerLoader
495+
from langchain_azure_storage.document_loaders import AzureBlobStorageLoader
494496

495-
loader = AzureBlobStorageContainerLoader("https://<account>.blob.core.windows.net", "<container>")
497+
loader = AzureBlobStorageLoader("https://<account>.blob.core.windows.net", "<container>")
496498
for doc in loader.lazy_load():
497499
print(doc.metadata["source"]) # Prints URL of each blob in the container.
498500
```
@@ -515,7 +517,7 @@ from langchain_core.documents import Document
515517
from typing import Iterator
516518

517519

518-
class AzureBlobStorageContainerLoader(BaseLoader):
520+
class AzureBlobStorageLoader(BaseLoader):
519521
...
520522
def _lazy_load_from_custom_loader(self, blob_name: str) -> Iterator[Document]:
521523
with tempfile.NamedTemporaryFile() as temp_file:
@@ -573,6 +575,27 @@ However, similar to why document loaders were chosen over blob loaders, blob par
573575
over libraries like Unstructured and takeaway from the batteries-included value proposition that LangChain document
574576
loaders provide.
575577

578+
It's important to note that this decision does not prevent us from exposing a `blob_parser` parameter in the future.
579+
Specifically, this would be useful if we see customers wanting to customize loading behavior more but not necessarily
580+
want to drop down to using a blob loader interface.
581+
582+
583+
#### Exposing document loaders as two classes, `AzureBlobStorageFileLoader` and `AzureBlobStorageContainerLoader`, instead of a single `AzureBlobStorageLoader`
584+
Exposing the document loaders as these two classes would be beneficial in that they would match the existing community
585+
document loaders and lessen the amount of changes needed to migrate. However, combining them into a single class
586+
has the following advantages:
587+
588+
* It simplifies the getting started experience. Customers will no longer have to make a decision on which Azure Storage
589+
document loader class to use as there will be only one document loader class to choose from.
590+
* It simplifies class names by removing the additional `File` and `Container` qualifiers, which could lead to
591+
misinterpretations on what the classes do.
592+
* It is easier to maintain as there is only one class that will need to be maintained and less code will likely need to
593+
be duplicated.
594+
595+
While this will introduce an additional step in migrating (i.e., change class names), the impact is limited
596+
as customers will still be providing the same positional parameters even after changing class names
597+
(i.e., use account + container for the container loader and account + container + blob for the file loader).
598+
576599

577600
#### Alternatives to default parsing to UTF-8 text
578601
The default parsing logic when no `loader_factory` is provided is to treat the blob content as UTF-8 text
@@ -638,10 +661,10 @@ customize how blobs are parsed to text. However, possible requested extension po
638661
* Wanting the blob data to be passed using an in-memory representation than file on disk
639662

640663
If we ever plan to extend the interface, we should strongly consider exposing blob loaders
641-
instead as discussed in the [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory)
664+
and/or a `blob_parser` parameter instead as discussed in the [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory)
642665
section above.
643666

644-
If blob loaders do not suffice, we could consider expanding the `loader_factory` to:
667+
If blob loaders nor a `blob_parser` parameter suffice, we could consider expanding the `loader_factory` to:
645668

646669
* Inspect signature arguments of callable provided to `loader_factory` and call the callable with
647670
additional parameters if detected (e.g., detect if the a `blob_properties` parameter is present and
@@ -666,7 +689,7 @@ Based on customer requests, in the future, we could consider exposing these prop
666689
## Future work
667690
Below are some possible future work ideas that could be considered after the initial implementation based on customer feedback:
668691

669-
* Expose blob loader integrations for Azure Blob Storage (see [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) section).
692+
* Expose blob loader and/or blob parser integrations (see [alternatives considered](#exposing-a-blob_parser-parameter-instead-of-loader_factory) section).
670693
* Proxy additional blob properties as document metadata (see [FAQs](#q-why-is-the-blob-properties-not-exposed-in-the-document-metadata) section).
671694
* Support `async_credential` parameter to allow using both sync and async token credentials with a single document loader instance
672695
(see [FAQs](#q-why-not-support-synchronous-token-credentials-when-calling-asynchronous-methods-and-vice-versa) section).

0 commit comments

Comments
 (0)