Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute the MD5 hash of the file for Azure Storage #1187

Open
penguoir opened this issue Sep 16, 2024 · 9 comments
Open

Compute the MD5 hash of the file for Azure Storage #1187

penguoir opened this issue Sep 16, 2024 · 9 comments

Comments

@penguoir
Copy link

Is your feature request related to a problem? Please describe.
I'm trying to use Ruby on Rails + Tusd. Rails' Active Storage requires a checksum to verify file integrity. Right now, Tusd doesn't compute the hash, so I have to disable the checksum verification on Rails, which is a cumbersome process.

Describe the solution you'd like
When saving blobs to Azure, save their MD5 hash too.

Describe alternatives you've considered

  • Can generate the hash on Rails after uploading the file to Azure
  • Can disable integrity verification

Can you provide help with implementing this feature?
Yes happy to help!

Additional context

This was mentioned when Azure storage was added:

#401 (comment)

Copied from that thread:

may I ask what benefit computing the MD5 hash of the file has? (I've never used it, so I am curious) Would you compute the MD5 hash of the block of the blob, or the entire file?

The hash is used to verify the integrity of the blob/file during transport.

Also, you can verify you don't have duplicates on your system.

@penguoir
Copy link
Author

As a side note, I'd also like to save the metadata (currently saved under a separate blob) in the Azure-provided "metadata" field. But I can open a separate issue for that.

@penguoir
Copy link
Author

Just had a look at implementing this:

  • I don't think it's right to add the hash to the info blob. We write the info blob before reading the actual file, so there's no way to get the file's hash into the info blob (unless we update the info blob after the file is uploaded).
  • I'm not sure whether it's possible to correctly compute the hash of the uploaded file. Depends on:
    • Does UploadChunk always run in start-to-finish order? Does it run in parallel or sequentially? We need it to run in start-to-finish order and sequentially for the hash function to work.
  • How do we handle resumable uploads?

@Acconut
Copy link
Member

Acconut commented Sep 17, 2024

Thanks for bringing this up. tusd currently doesn't have any feature for calculating or comparing checksums of the uploaded data, but I would like to change this in the future while adding support for the tus checksum extension and HTTP digest fields (for draft-ietf-httpbis-resumable-upload). I love to collaborate with you on this if you are interested.

Support for checksums / file integrity checks shouldn't be tied to Azure or any other storage. Instead, calculating the checksum while the data is uploaded is the responsibility of the central upload handling logic (in unrouted_handler.go). The calculated digests can then be used to verify the integrity of the entire upload or individual PATCH requests and be provided to the storage or hooks.

  • How do we handle resumable uploads?

We would have to store the state of the checksum calculation if the upload is interrupted/saved. If the upload is resumed, we can continue the calculation until the upload is finished.

@penguoir
Copy link
Author

Sure I can spend a couple hours looking into this over the weekend and see how far I get.

@Acconut
Copy link
Member

Acconut commented Sep 17, 2024

That sounds great! Before heading into an implementation, we can also brainstorm different implementation designs to make sure we cover all requirements.

@penguoir
Copy link
Author

I've moved to an alternative storage provider that doesn't need a hash, so don't have appetite to work on this ticket. Sorry!

@Acconut
Copy link
Member

Acconut commented Sep 30, 2024

No worries, thank you for the heads up!

@cfarmer-fearless
Copy link

Hello, hoping to open this thread up again. I am a maintainer for this project, which uses tusd as a Golang library. This project has the capability of copying files to secondary storage locations once upload completes, and I would like to use content hashes to verify the integrity of the file once it has been copied.

Some high level questions I have are...

  • How should the hash of the file be generated? You mention it can be done as chunks come across, but how does that work?
  • Seems like Azure and AWS have the capability to generate checksums asynchronously. Should we rely on that?
  • What specific HTTP headers need to be manipulated?
  • Can this be done in hooks or does it need to be done deeper within tusd?
  • What if the client wants to provide their own hash?

@Acconut
Copy link
Member

Acconut commented Nov 26, 2024

I'm glad to hear that you are interested! Let me answer these questions:

  • How should the hash of the file be generated? You mention it can be done as chunks come across, but how does that work?

I haven't tested this, but in theory it should be possible to calculate a checksum while the server receives data. It only has to ensure that it saves the state of the checksum calculation between requests if an upload is interrupted or intentionally split across multiple requests.

The hash algorithms in Go's standard library support this serialization: https://pkg.go.dev/hash#Hash

Hash implementations in the standard library (e.g. hash/crc32 and crypto/sha256) implement the encoding.BinaryMarshaler and encoding.BinaryUnmarshaler interfaces. Marshaling a hash implementation allows its internal state to be saved and used for additional processing later, without having to re-write the data previously written to the hash. The hash state may contain portions of the input in its original form, which users are expected to handle for any possible security implications.

Compatibility: Any future changes to hash or crypto packages will endeavor to maintain compatibility with state encoded using previous versions. That is, any released versions of the packages should be able to decode data written with any previously released version, subject to issues such as security fixes. See the Go compatibility document for background: https://golang.org/doc/go1compat

  • Seems like Azure and AWS have the capability to generate checksums asynchronously. Should we rely on that?

Checksums can be calculated by the client, tusd or the service behind the storage backend (e.g. AWS). Furthermore, each of these could also verify a checksum if they receive it. Every combination is possible and useful in specific scenarios. If you want to ensure that the file got correctly uploaded from the client to AWS, then it makes sense to compare the client's checksum with AWS' checksum. Tusd could relay the checksum from AWS to the client or vice versa depending on whether the checksum was available when the upload started or not.

So there is no single correct answer. In some cases, we could reuse checksums from the provider, but we should also consider handling them in tusd at the same time.

  • What specific HTTP headers need to be manipulated?
  • What if the client wants to provide their own hash?
    The tus protocol defines the Upload-Checksum header for the client to send checksums to the tus server: https://tus.io/protocols/resumable-upload#checksum
  • Can this be done in hooks or does it need to be done deeper within tusd?

The solution I envision would be integrated into tusd directly to avoid users having to deal with these specifics on their own. That should be the most useful and robust approach.

That being said, these are just my thoughts so far. If you have something different in mind, please share your opinion and we can discuss it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants