-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute the MD5 hash of the file for Azure Storage #1187
Comments
As a side note, I'd also like to save the metadata (currently saved under a separate blob) in the Azure-provided "metadata" field. But I can open a separate issue for that. |
Just had a look at implementing this:
|
Thanks for bringing this up. tusd currently doesn't have any feature for calculating or comparing checksums of the uploaded data, but I would like to change this in the future while adding support for the tus checksum extension and HTTP digest fields (for draft-ietf-httpbis-resumable-upload). I love to collaborate with you on this if you are interested. Support for checksums / file integrity checks shouldn't be tied to Azure or any other storage. Instead, calculating the checksum while the data is uploaded is the responsibility of the central upload handling logic (in unrouted_handler.go). The calculated digests can then be used to verify the integrity of the entire upload or individual PATCH requests and be provided to the storage or hooks.
We would have to store the state of the checksum calculation if the upload is interrupted/saved. If the upload is resumed, we can continue the calculation until the upload is finished. |
Sure I can spend a couple hours looking into this over the weekend and see how far I get. |
That sounds great! Before heading into an implementation, we can also brainstorm different implementation designs to make sure we cover all requirements. |
I've moved to an alternative storage provider that doesn't need a hash, so don't have appetite to work on this ticket. Sorry! |
No worries, thank you for the heads up! |
Hello, hoping to open this thread up again. I am a maintainer for this project, which uses tusd as a Golang library. This project has the capability of copying files to secondary storage locations once upload completes, and I would like to use content hashes to verify the integrity of the file once it has been copied. Some high level questions I have are...
|
I'm glad to hear that you are interested! Let me answer these questions:
I haven't tested this, but in theory it should be possible to calculate a checksum while the server receives data. It only has to ensure that it saves the state of the checksum calculation between requests if an upload is interrupted or intentionally split across multiple requests. The hash algorithms in Go's standard library support this serialization: https://pkg.go.dev/hash#Hash
Checksums can be calculated by the client, tusd or the service behind the storage backend (e.g. AWS). Furthermore, each of these could also verify a checksum if they receive it. Every combination is possible and useful in specific scenarios. If you want to ensure that the file got correctly uploaded from the client to AWS, then it makes sense to compare the client's checksum with AWS' checksum. Tusd could relay the checksum from AWS to the client or vice versa depending on whether the checksum was available when the upload started or not. So there is no single correct answer. In some cases, we could reuse checksums from the provider, but we should also consider handling them in tusd at the same time.
The solution I envision would be integrated into tusd directly to avoid users having to deal with these specifics on their own. That should be the most useful and robust approach. That being said, these are just my thoughts so far. If you have something different in mind, please share your opinion and we can discuss it here. |
Is your feature request related to a problem? Please describe.
I'm trying to use Ruby on Rails + Tusd. Rails' Active Storage requires a checksum to verify file integrity. Right now, Tusd doesn't compute the hash, so I have to disable the checksum verification on Rails, which is a cumbersome process.
Describe the solution you'd like
When saving blobs to Azure, save their MD5 hash too.
Describe alternatives you've considered
Can you provide help with implementing this feature?
Yes happy to help!
Additional context
This was mentioned when Azure storage was added:
#401 (comment)
Copied from that thread:
Also, you can verify you don't have duplicates on your system.
The text was updated successfully, but these errors were encountered: