Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Externalized attachments #1021

Open
fflorent opened this issue Jun 6, 2024 · 7 comments
Open

Externalized attachments #1021

fflorent opened this issue Jun 6, 2024 · 7 comments
Labels
Attachments enhancement New feature or request

Comments

@fflorent
Copy link
Collaborator

fflorent commented Jun 6, 2024

The idea has been proposed by @paulfitz #886 (comment)

If I understand correctly, and I tend to think that's also something we would like on the ANCT's side: the idea would be to allow creating an attachment column where the file would not be injected in the document but rather in an external storage like S3.

There are several advantages, especially reducing the grist document size and thus allowing increasing the attachment size limit.

@fflorent fflorent added enhancement New feature or request Attachments labels Jun 6, 2024
@paulfitz
Copy link
Member

paulfitz commented Jun 6, 2024

Thanks for opening this @fflorent ! Going to dump in some thoughts a developer prepared about this in 2022. Grist has changed somewhat since, and also this was not a plan just one person's thoughts (I personally disagree with some of it), but the set of concerns raised could be helpful for things about this project.

Externalizing attachments

External store

We need a generic interface for storing and retrieving file data that can be implemented in different ways. One obvious data store is S3-compatible stores. Theoretically, the local filesystem might also work.

Migration

Once at least one store is implemented, _gristsys_Files could be deprecated, and a special migration could move the data from there to the external store. The usual Python migration system won’t be enough on its own because the data engine doesn’t see _gristsys_Files, but maybe it could make an external call to node to deal with that.

Downloading documents

We still want to be able to download a single self-contained .grist sqlite file containing all the attachments. When this happens, we’d need to:

  1. Make a copy of the database file
  2. Download all the externalized attachments and put them in the copy, perhaps back in _gristsys_Files
  3. Give that to the user to download
  4. When the document is uploaded again, perform the same process as the migration to move data to the external store.

This would also allow using downloaded documents in older versions of Grist.

Serving attachments without the DocWorker

Currently the client uses a special DocWorker API to view and download attachments. To serve the files, the DocWorker retrieves them from _gristsys_Files. In the first iteration of work, this would be changed to retrieving them from the external store instead. But in the long term, it would be nice if the client could bypass the DocWorker and retrieve the files directly from the store. S3 would work well for this, but other types of store may not allow this.

Deleting externalized attachments

Attachments are likely to contain sensitive data, and storing them longer than necessary is a security risk. When a user deletes an attachment, it’s reasonable for them to expect it to actually be deleted eventually, just like any other data, so that it can’t be leaked. This applies whether they deleted a row, a document, or an entire organisation. We can’t actually fully delete the data immediately in the first case because deleted rows still live in the snapshot history, but we should delete them eventually.

This is like the problem of tracking attachments referenced within a document, on a much larger scale. In this case actually tracking the references (or maybe just their counts) from documents to the external store seems essential. These would need to be updated whenever a document is copied or deleted within a Grist installation. We’d need to consider:

  • “Duplicate Document”
  • “Work on a copy”
  • Other ways of ‘forking’ such as from fiddle mode or templates
  • Creation and pruning of snapshots.
  • Deleting a document permanently.

Downloading a document ‘disconnects’ it from the Grist installation so it doesn’t need to be counted. It has its own copy of the attachments so it should either delete or ignore the metadata about the externalized data.

Encryption

An alternative to tracking attachment references to allow deleting them is to encrypt the attachment data to avoid the need to delete it. Each attachment file would have a unique encryption key stored only in the corresponding row of _grist_Attachments . Once all copies of that row are fully deleted, the encryption key should be lost, and decrypting the data in the external store should become impossible. That means we don’t ever have to delete the actual data, so we don’t need to keep track of references to it.

Another security benefit of encryption is that if someone gains access to the data in the external attachments store, they can’t actually read it unless they also have the referencing documents.

One downside is that serving attachments directly from S3 instead of the DocWorker becomes more tricky. Decrypting and displaying a single encrypted file in the browser using SubtleCrypto and createObjectURL seems straightforward. But it’s a lot more delicate to handle a user scrolling through a grid filled with thumbnails, displaying them all efficiently and then reclaiming memory after they disappear from view.

Access Control

Would need thinking about. Important to preserve the property that the existing metadata (particularly fileIdent) is not enough to download the file, so that access is properly revoked even if someone has a past copy of the metadata. It might also be nice if the download URL couldn’t be computed purely from the file content, so that someone with a local copy of a file can’t test whether it exists in the document.

@Sieboldianus
Copy link

Sieboldianus commented Sep 10, 2024

I added this to the other issue already. If you are looking for inspiration, HackMD has a pretty nice integration of Drag&Drop Image upload, which just puts files in a folder on the server using a random hash. UX-wise it is a pretty nice seamless integration and works flawlessly.

@paulfitz
Copy link
Member

paulfitz commented Oct 9, 2024

Just to note that architecture work for this feature has started, with some initial thinking from @Spoffy here:
https://docs.google.com/document/d/1ST_DuR22llDyx4PVAMdBlHDz22L5QIJ4qIyLMOUOtZ8/edit#heading=h.guzwiuwiigkd
If anyone would like edit rights, let me know.

There are trade-offs to be made since Grist will no longer be a standalone single-file format. We're trying to come up with a design that still makes moving Grist docs between installations practical.

@paulfitz paulfitz pinned this issue Oct 10, 2024
@paulfitz
Copy link
Member

@Spoffy I updated the proposed change to _grist_Files as follows:

  • Removed a separate new identifier column. How about we work with the existing ident column, improving it as needed. It is a little funky but that's ok, and better than having two very similar columns.
  • I converted the boolean column to an enum with just a single non-null value. It is equivalent and every boolean I've let into a schema I've eventually regretted.

If you combine these two changes, you end up with a very harmless low-commitment schema change that seems safe to do even without all details worked out?

@Spoffy
Copy link
Contributor

Spoffy commented Oct 28, 2024

Paul and I have an initial design for this now, found here.

The latest design / implementation is in the doc, so I won't copy and paste it here.

Next steps are prototyping this approach and run some basic tests to ensure we haven't missed anything that would cause a big problem.

@Spoffy
Copy link
Contributor

Spoffy commented Nov 13, 2024

Quick update on this. The prototype is underway now, and we've got basic attachment functionality working with MinIO and Filesystem storage.

@Spoffy
Copy link
Contributor

Spoffy commented Nov 25, 2024

The first PR for this is open, albeit not yet ready for full review and merging: #1320

It covers the necessary core document changes to make external attachments function, with just about enough configuration to let it be testable.

The full scope of that PR is in the PR description 🙂

@fflorent @vviers @hexaltation - if any of you want to have a first look, now is a good time 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Attachments enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants