You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copied from a comment from @cymon in another issue
There are a bunch more examples of ro-crates in ./emo-bon-ro-crate-repository these are all the sediment analyses for batch 1 and 2.
I created a "ghost" archive for each run where all the files are present but the file contents are missing (except for the word "ghost") - this means all the links in the ro-crates metadata.json are functional. Of course, once we have a public S3 repo (and not just the loner from d4science that we are currently using), we can upload the real data - but this just prototype procedures effectively.
DVC - I coded up the functions to add dvc stubs archives: but to my mind there are several issues with using DVC
The data that results from a MGF doesnt change (does not get updated) - so the main point of DVC to keep track of changing data sets is irrelevant
adding DVC stubs to files and uploading to S3 means there are no typical download URLs to use a links in the metadata.json stanzas
to down load the data from S3 using the dvc file, you need to install dvc and pull. So this is doable but will turn people off from using the data if they have to go through a dvc configuration. We could provide the dvc config in each ro-crate but again more flaff...
the original idea was only to have "large" files in S3 and the "smaller" files on github, but dvc can only be initialised with git if the the dvc root folder is also the git root folder. This isnt going to work. We could make a separate repository for emo-bon-ro-crates and init both git and dvc in the root directory, but at this point it's getting really messy and difficult to explain to someone who just want to get the data
Basically, the alternative - ie not using dvc - just make life so much easier:
we upload all of the data files (big and small) to the S3 store using s5cmd and use the public URLs as the download URLs in the stanza. This is what I've currently implemented, because it is just so much simpler.
The text was updated successfully, but these errors were encountered:
Copied from a comment from @cymon in another issue
There are a bunch more examples of ro-crates in ./emo-bon-ro-crate-repository these are all the sediment analyses for batch 1 and 2.
I created a "ghost" archive for each run where all the files are present but the file contents are missing (except for the word "ghost") - this means all the links in the ro-crates metadata.json are functional. Of course, once we have a public S3 repo (and not just the loner from d4science that we are currently using), we can upload the real data - but this just prototype procedures effectively.
DVC - I coded up the functions to add dvc stubs archives: but to my mind there are several issues with using DVC
Basically, the alternative - ie not using dvc - just make life so much easier:
The text was updated successfully, but these errors were encountered: