Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Streamlit hosting - data persistence support #63

Open
janaka opened this issue Jul 14, 2023 · 3 comments
Open

RFC: Streamlit hosting - data persistence support #63

janaka opened this issue Jul 14, 2023 · 3 comments
Assignees
Labels

Comments

@janaka
Copy link
Contributor

janaka commented Jul 14, 2023

Situation/Problem

Each time Docq is deployed to Streamlit cloud it wipes all the data because isn't using ephemeral storage. So this hosting option can only used for throw away demo mode. It cannot be used for any real customer scenarios. Streamlit hosting is on the low cost and easy end of hosting option. Such as option has a place in a customers journey to adopting Docq.

If we can persist data we can use it for hosting a real usable version for customers that can be used for serious trials/pilots.

Requirments

Components with disk persistence that need to be altered:

  • SQLite - is using the standard disk-based persistence requiring a mount point

  • Datasource document list tracking - uses the Python standard lib json which requires a disk mount point

  • LlamaIndex index - is using the standard disk-based persistence requiring a mount point-

  • Manual file upload - uses a standard. st.file_uploader returns a byte array which is written to disk using a standard file handler.

Have the ability to configure the deployment to be S3 backed or filesystem mount backed for persistence.

Solution

The high-level approach is to use an S3 bucket as the backing store. This is not a drop in approach. This solution is proposed because there doesn't seem to be a drop in solution. That means Streamlit Cloud doesn't seem to support persistent filesystem mounts.

Each of the components we use that persists data will need to have some sort of support for S3 as a backing store. Below is the S3 backing solution for each component.

  • SQLite - https://github.com/uktrade/sqlite-s3vfs. Does the concurrency model change. s3vfs makes a point that it doesn't handle concurrent write and needs to be handled in the app.

  • LlamaIndex - StorageContext can take an instance of fsspec as for persistence via the fs argument. fsspec, S3fs

  • document list - does the json module support a byte array/stream interface? if so use that together with an S3 interface module like s3fs.

  • manual file uploads - switch to using fsspec / [s3fs](https://s3fs.readthedocs.io/en/latest/) rather than standard file handler.

fsspec support several backing stores like S3, localfile, GCS, etc.

Alternatives

Simple persistent filesystem mout

This would be the simplest solution and therefore idea. It will require no code changes. However there doesn't seem to be an option for this in Streamlit cloud.

Streamlit file connections

This is unlikely to work given none of the components we use that persist data will not support this interface out of the box.

Streamlit data connections feature abstracts over s3fs hence fsspec. Specifically using S3, Streamlit file connection, and S3fs

See KB article

@janaka janaka self-assigned this Jul 14, 2023
@janaka janaka changed the title INFRA: Streamlit hosting - data persistence support RFC: Streamlit hosting - data persistence support Jul 14, 2023
@janaka
Copy link
Contributor Author

janaka commented Jul 14, 2023

@cwang looked into the whole Streamlit file persistence thing. I can't see an easy drop in option. Basically, persistent filesystem mount doesn't seem to be supported. Above is the best alter I can think of.

SQLite is the biggest unknown and risk given the s3vfs comment:

Python virtual filesystem for SQLite to read from and write to S3.

No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.

@cwang
Copy link
Contributor

cwang commented Sep 8, 2023

Should be able to run it with render.com, utilising the mounted share disk

@janaka
Copy link
Contributor Author

janaka commented Sep 10, 2023

Yes. But it would be free on Azure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants