Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add S3 integration for input/output files #597

Open
willkara opened this issue Dec 13, 2024 · 2 comments
Open

Add S3 integration for input/output files #597

willkara opened this issue Dec 13, 2024 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@willkara
Copy link

Requested feature

A lot of groups make heavy use of S3 compatible endpoints like S3 itself and LakeFS to store their documents and artifacts. I was hoping that the team would be able to take a look at implementing it as a possible integration for storage.

Thought is that teams could pass an S3 compatible bucket as a source/destination and docling would automatically read in user credentials and configuration for reading/writing to the bucket for results.
...

Alternatives

None
...

@willkara willkara added the enhancement New feature or request label Dec 13, 2024
@dolfim-ibm dolfim-ibm added the help wanted Extra attention is needed label Dec 16, 2024
@dolfim-ibm
Copy link
Contributor

You might be interested in trying out Data-Prep-Kit which will orchestrate a distributed batch conversion using Docling, and it has already support for s3 storage: https://ds4sd.github.io/docling/integrations/data_prep_kit/

Connectors directly in Docling might also be supported in the future, but there you can get something already working.

@willkara
Copy link
Author

O snap, that's awesome, will definitely check it out. Running manual test-cases and 5000+ file conversions manually for now in DevSpaces, so I can prove it out for wider usage.

Data-Prep-Kit seems like it's more of an orchestrator for these kinds of workloads using the individual tools. Given that, I wonder if Docling would/should focus on agnostic integrations, rather than direct support. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants