You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each time Docq is deployed to Streamlit cloud it wipes all the data because isn't using ephemeral storage. So this hosting option can only used for throw away demo mode. It cannot be used for any real customer scenarios. Streamlit hosting is on the low cost and easy end of hosting option. Such as option has a place in a customers journey to adopting Docq.
If we can persist data we can use it for hosting a real usable version for customers that can be used for serious trials/pilots.
Requirments
Components with disk persistence that need to be altered:
SQLite - is using the standard disk-based persistence requiring a mount point
Datasource document list tracking - uses the Python standard lib json which requires a disk mount point
LlamaIndex index - is using the standard disk-based persistence requiring a mount point-
Manual file upload - uses a standard. st.file_uploader returns a byte array which is written to disk using a standard file handler.
Have the ability to configure the deployment to be S3 backed or filesystem mount backed for persistence.
Solution
The high-level approach is to use an S3 bucket as the backing store. This is not a drop in approach. This solution is proposed because there doesn't seem to be a drop in solution. That means Streamlit Cloud doesn't seem to support persistent filesystem mounts.
Each of the components we use that persists data will need to have some sort of support for S3 as a backing store. Below is the S3 backing solution for each component.
SQLite - https://github.com/uktrade/sqlite-s3vfs. Does the concurrency model change. s3vfs makes a point that it doesn't handle concurrent write and needs to be handled in the app.
LlamaIndex - StorageContext can take an instance of fsspec as for persistence via the fs argument. fsspec, S3fs
document list - does the json module support a byte array/stream interface? if so use that together with an S3 interface module like s3fs.
manual file uploads - switch to using fsspec / [s3fs](https://s3fs.readthedocs.io/en/latest/) rather than standard file handler.
fsspec support several backing stores like S3, localfile, GCS, etc.
Alternatives
Simple persistent filesystem mout
This would be the simplest solution and therefore idea. It will require no code changes. However there doesn't seem to be an option for this in Streamlit cloud.
Streamlit file connections
This is unlikely to work given none of the components we use that persist data will not support this interface out of the box.
@cwang looked into the whole Streamlit file persistence thing. I can't see an easy drop in option. Basically, persistent filesystem mount doesn't seem to be supported. Above is the best alter I can think of.
SQLite is the biggest unknown and risk given the s3vfs comment:
Python virtual filesystem for SQLite to read from and write to S3.
No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.
Situation/Problem
Each time Docq is deployed to Streamlit cloud it wipes all the data because isn't using ephemeral storage. So this hosting option can only used for throw away demo mode. It cannot be used for any real customer scenarios. Streamlit hosting is on the low cost and easy end of hosting option. Such as option has a place in a customers journey to adopting Docq.
If we can persist data we can use it for hosting a real usable version for customers that can be used for serious trials/pilots.
Requirments
Components with disk persistence that need to be altered:
SQLite - is using the standard disk-based persistence requiring a mount point
Datasource document list tracking - uses the Python standard lib
json
which requires a disk mount pointLlamaIndex index - is using the standard disk-based persistence requiring a mount point-
Manual file upload - uses a standard. st.file_uploader returns a byte array which is written to disk using a standard file handler.
Have the ability to configure the deployment to be S3 backed or filesystem mount backed for persistence.
Solution
The high-level approach is to use an S3 bucket as the backing store. This is not a drop in approach. This solution is proposed because there doesn't seem to be a drop in solution. That means Streamlit Cloud doesn't seem to support persistent filesystem mounts.
Each of the components we use that persists data will need to have some sort of support for S3 as a backing store. Below is the S3 backing solution for each component.
SQLite - https://github.com/uktrade/sqlite-s3vfs. Does the concurrency model change. s3vfs makes a point that it doesn't handle concurrent write and needs to be handled in the app.
LlamaIndex -
StorageContext
can take an instance offsspec
as for persistence via thefs
argument. fsspec, S3fsdocument list - does the
json
module support a byte array/stream interface? if so use that together with an S3 interface module likes3fs
.manual file uploads - switch to using
fsspec
/[s3fs](https://s3fs.readthedocs.io/en/latest/)
rather than standard file handler.fsspec support several backing stores like S3, localfile, GCS, etc.
Alternatives
Simple persistent filesystem mout
This would be the simplest solution and therefore idea. It will require no code changes. However there doesn't seem to be an option for this in Streamlit cloud.
Streamlit file connections
This is unlikely to work given none of the components we use that persist data will not support this interface out of the box.
Streamlit data connections feature abstracts over s3fs hence fsspec. Specifically using S3, Streamlit file connection, and S3fs
See KB article
The text was updated successfully, but these errors were encountered: