Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bind DVC data registry #880

Merged
merged 14 commits into from
Mar 6, 2024
Merged

Conversation

AlexVCaron
Copy link
Contributor

@AlexVCaron AlexVCaron commented Jan 17, 2024

Quick description

Everything required in Scilpy to throw away gdrive and embrace DVC !

From discussions :

  • Scil_data : contains all datasets we need to test, validate, try code, and more (in ./data). Contains packages of datasets easy to download and version in one line (in ./store).
  • Scilpy : has a DVC handle pointing to Scil_data. Loads files and packages by interfacing through DVC.
  • Other repos : use the same gammick as Scilpy to load datasets they require.

Looking at everything (go through scil_data, it seems like a lot of duplication, disk space must be wasted !!!!! Not at all.

Everything contained in the repository is index files for DVC, kilobyte sized, super light for Git. The data per say is all contained here. You won't be able to recognize anything, that's the point. Every file indexed by DVC gets hashed and the md5 is used then to create the filepath.

Doing so, we can allow duplication in the Git repository - since md5 always match, even when the names and timestamps differ - and there will never be duplication on the data server.

Using DVC is very similar to how we used the fetcher, with the addition that we now need to keep track of test packages revisions. To create a new test data package :

  1. Clone scil_data and follow the documentation to create a new package.
  2. Once uploaded, copy the commit at which the package was published and add it to data/test_descriptors.yml under your test case name.
  3. Use the pull_test_case_package function from scilpy.io.dvc in your test script to make the data available when testing.

...

Type of change

Check the relevant options.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Provide data, screenshots, command line to test (if relevant)

The aodf_metrics test case is implemented as POC :

pytest scripts/tests/test_aodf_metrics.py

Checklist

  • My code follows the style guidelines of this project (run autopep8)
  • I added relevant citations to scripts, modules and functions docstrings and descriptions
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I moved all functions from the script file (except the argparser and main) to scilpy modules
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@AlexVCaron AlexVCaron changed the title Feat/dvc_registry Bind DVC data registry Jan 17, 2024
@arnaudbore arnaudbore self-requested a review January 17, 2024 21:36
@arnaudbore
Copy link
Contributor

@arnaudbore
Copy link
Contributor

@arnaudbore
Copy link
Contributor

@arnaudbore
Copy link
Contributor

Build passed ! Good Job 🍻 !

Copy link

codecov bot commented Feb 29, 2024

Codecov Report

Attention: Patch coverage is 87.87879% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 69.49%. Comparing base (4491228) to head (9f34e4d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #880      +/-   ##
==========================================
+ Coverage   69.46%   69.49%   +0.03%     
==========================================
  Files         390      392       +2     
  Lines       20944    20966      +22     
  Branches     3205     3207       +2     
==========================================
+ Hits        14548    14570      +22     
  Misses       5093     5093              
  Partials     1303     1303              
Components Coverage Δ
Scripts 71.93% <ø> (+0.01%) ⬆️
Library 65.48% <87.87%> (+0.07%) ⬆️

Copy link
Contributor

@arnaudbore arnaudbore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments, I would like @EmmaRenauld or/and @karanphil to give their feedback about the init.py first before moving forward.

.dvc/config Outdated Show resolved Hide resolved
scilpy/io/dvc.py Outdated Show resolved Hide resolved
scilpy/__init__.py Show resolved Hide resolved
@AlexVCaron
Copy link
Contributor Author

I found a bug in DVC (in a sub dependency dvc-data). It makes the cache slightly useless for now, but doesn't undermine this PR. Still we can follow iterative/dvc-data#512

Copy link
Contributor

@arnaudbore arnaudbore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GTG

@arnaudbore arnaudbore merged commit 73d1aa5 into scilus:master Mar 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants