DVC vs S3 #5

kmexter · 2024-11-06T07:27:10Z

Copied from a comment from @cymon in another issue

There are a bunch more examples of ro-crates in ./emo-bon-ro-crate-repository these are all the sediment analyses for batch 1 and 2.

I created a "ghost" archive for each run where all the files are present but the file contents are missing (except for the word "ghost") - this means all the links in the ro-crates metadata.json are functional. Of course, once we have a public S3 repo (and not just the loner from d4science that we are currently using), we can upload the real data - but this just prototype procedures effectively.

DVC - I coded up the functions to add dvc stubs archives: but to my mind there are several issues with using DVC

The data that results from a MGF doesnt change (does not get updated) - so the main point of DVC to keep track of changing data sets is irrelevant
adding DVC stubs to files and uploading to S3 means there are no typical download URLs to use a links in the metadata.json stanzas
to down load the data from S3 using the dvc file, you need to install dvc and pull. So this is doable but will turn people off from using the data if they have to go through a dvc configuration. We could provide the dvc config in each ro-crate but again more flaff...
the original idea was only to have "large" files in S3 and the "smaller" files on github, but dvc can only be initialised with git if the the dvc root folder is also the git root folder. This isnt going to work. We could make a separate repository for emo-bon-ro-crates and init both git and dvc in the root directory, but at this point it's getting really messy and difficult to explain to someone who just want to get the data

Basically, the alternative - ie not using dvc - just make life so much easier:

we upload all of the data files (big and small) to the S3 store using s5cmd and use the public URLs as the download URLs in the stanza. This is what I've currently implemented, because it is just so much simpler.

kmexter assigned cymon and marc-portier Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVC vs S3 #5

DVC vs S3 #5

kmexter commented Nov 6, 2024

DVC vs S3 #5

DVC vs S3 #5

Comments

kmexter commented Nov 6, 2024