Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC vs S3 #5

Open
kmexter opened this issue Nov 6, 2024 · 0 comments
Open

DVC vs S3 #5

kmexter opened this issue Nov 6, 2024 · 0 comments
Assignees

Comments

@kmexter
Copy link
Contributor

kmexter commented Nov 6, 2024

Copied from a comment from @cymon in another issue

There are a bunch more examples of ro-crates in ./emo-bon-ro-crate-repository these are all the sediment analyses for batch 1 and 2.

I created a "ghost" archive for each run where all the files are present but the file contents are missing (except for the word "ghost") - this means all the links in the ro-crates metadata.json are functional. Of course, once we have a public S3 repo (and not just the loner from d4science that we are currently using), we can upload the real data - but this just prototype procedures effectively.

DVC - I coded up the functions to add dvc stubs archives: but to my mind there are several issues with using DVC

  • The data that results from a MGF doesnt change (does not get updated) - so the main point of DVC to keep track of changing data sets is irrelevant
  • adding DVC stubs to files and uploading to S3 means there are no typical download URLs to use a links in the metadata.json stanzas
  • to down load the data from S3 using the dvc file, you need to install dvc and pull. So this is doable but will turn people off from using the data if they have to go through a dvc configuration. We could provide the dvc config in each ro-crate but again more flaff...
  • the original idea was only to have "large" files in S3 and the "smaller" files on github, but dvc can only be initialised with git if the the dvc root folder is also the git root folder. This isnt going to work. We could make a separate repository for emo-bon-ro-crates and init both git and dvc in the root directory, but at this point it's getting really messy and difficult to explain to someone who just want to get the data

Basically, the alternative - ie not using dvc - just make life so much easier:

  • we upload all of the data files (big and small) to the S3 store using s5cmd and use the public URLs as the download URLs in the stanza. This is what I've currently implemented, because it is just so much simpler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants