-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage downstream datasets with locationless source dataset #178
base: develop
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #178 +/- ##
===========================================
+ Coverage 66.8% 68.42% +1.61%
===========================================
Files 42 42
Lines 3190 3230 +40
===========================================
+ Hits 2131 2210 +79
+ Misses 1059 1020 -39
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Santosh, thanks for getting this working.
There's a few things here that it'd be good to clean up before merging. Some are really easy fixes, but some of the logic is also getting hard to reason about and we should sit down and work through it together to see if we can simplify it.
- `global` is only required when assigning to a global variable - The product list only needs to be computed once - Change argument ordering to be consistent between functions
…tasets More coherence check improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Santosh, I think this is looking much better. I think there's still a few issues with the changes in tests.
Also, please try to list all the changes in a PR, for example for this PR:
- Fixed
dea-duplicates
- Required changing the setup in
collections.py
for unique fields - Updated the tests since the CSV output depends on the unique fields
- Required changing the setup in
- Added NBART scenes to
collections.py
- Fixed the output directory for the
dea-clean archived
tool - Extend
dea-coherence
to check downstream datasets
Incorporated all the requested changes
Reason for this pull request
Datasets within the production database had multiple
derived datasets
/duplicate datasets
derived from the source dataset whoselocal_uri
is set toNone
/Not Available
. This was causing ingestion failure during orchestration process. We need a new tool to manage this scenario and take appropriate action.Background
The above mentioned scenario could arise if we delete the file from the disk and run the
sync
tool with--update-location
option. Dataset location within the database are set toNone
/Not Available
by thesync
tool. So, duringingest
process, a query from the datacube would result inindex out of range
error and storage file creation fails.Proposed solutions
Following are the changes implemented as part of this PR.
dea-duplicates
collections.py
for unique fieldsCSV
output depends on the unique fieldsNBART
scenes tocollections.py
dea-clean
archived tooldownstream
datasets whose source dataset has no locationdea-coherence
to checkdownstream
datasetsCSV
file