Manage downstream datasets with locationless source dataset #178

santoshamohan · 2018-10-22T00:30:40Z

Reason for this pull request

Datasets within the production database had multiple derived datasets / duplicate datasets derived from the source dataset whose local_uri is set to None/Not Available. This was causing ingestion failure during orchestration process. We need a new tool to manage this scenario and take appropriate action.

Background
The above mentioned scenario could arise if we delete the file from the disk and run the sync tool with --update-location option. Dataset location within the database are set to None/Not Available by the sync tool. So, during ingest process, a query from the datacube would result in index out of range error and storage file creation fails.

Proposed solutions

Following are the changes implemented as part of this PR.

Fixed dea-duplicates
- Required changing the setup in collections.py for unique fields
- Updated the tests since the CSV output depends on the unique fields
Added NBART scenes to collections.py
Fixed the output directory for the dea-clean archived tool
Add a new feature to list all the downstream datasets whose source dataset has no location
Extend dea-coherence to check downstream datasets
Log erroneous datasets to a CSV file

Tests added

codecov · 2018-10-22T22:51:50Z

Codecov Report

Merging #178 into develop will increase coverage by 1.61%.
The diff coverage is 98.78%.

@@             Coverage Diff             @@
##           develop     #178      +/-   ##
===========================================
+ Coverage     66.8%   68.42%   +1.61%     
===========================================
  Files           42       42              
  Lines         3190     3230      +40     
===========================================
+ Hits          2131     2210      +79     
+ Misses        1059     1020      -39

Impacted Files	Coverage Δ
digitalearthau/collections.py	`99.18% <100%> (+0.13%)`	⬆️
digitalearthau/cleanup.py	`95.55% <100%> (ø)`	⬆️
digitalearthau/sync/validate.py	`48.93% <100%> (ø)`	⬆️
digitalearthau/coherence.py	`97.64% <98.38%> (+65.38%)`	⬆️
digitalearthau/uiutil.py	`83.33% <0%> (-16.67%)`	⬇️
digitalearthau/serialise.py	`89.13% <0%> (+1.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74e59f0...652f76e. Read the comment docs.

omad

Hey Santosh, thanks for getting this working.

There's a few things here that it'd be good to clean up before merging. Some are really easy fixes, but some of the logic is also getting hard to reason about and we should sit down and work through it together to see if we can simplify it.

digitalearthau/coherence.py

…ection

- `global` is only required when assigning to a global variable - The product list only needs to be computed once - Change argument ordering to be consistent between functions

…tasets More coherence check improvements

omad

Hi Santosh, I think this is looking much better. I think there's still a few issues with the changes in tests.

Also, please try to list all the changes in a PR, for example for this PR:

Fixed dea-duplicates
- Required changing the setup in collections.py for unique fields
- Updated the tests since the CSV output depends on the unique fields
Added NBART scenes to collections.py
Fixed the output directory for the dea-clean archived tool
Extend dea-coherence to check downstream datasets

integration_tests/conftest.py

integration_tests/test_coherence.py

integration_tests/test_full_sync.py

Incorporated all the requested changes

Santosh Mohan and others added 8 commits October 19, 2018 17:11

Add command line option to check bad downstream datasets

24d297a

Remove blank line containing whitespace

8e2e05b

Add tests to test dea-coherence

d0e8152

Update tests for dea-coherence

acad2d5

Fix pylint error

7266385

Revert collection name to ls8_scene_test

a8b0a7a

Revert changes to conftest.py

5016e5e

Merge branch 'develop' into downstream-datasets

125994c

santoshamohan requested a review from omad October 22, 2018 22:58

Santosh Mohan and others added 6 commits October 24, 2018 11:13

Update references to local_uri

bea14c7

Update log info

9ab3297

Ignore check-siblings for dsm1sv10

5786a4f

Ignore process/check siblings for nbar/nbart/pq albers products

be180e3

Update comment

63cc23e

Merge branch 'develop' into downstream-datasets

9b22e32

omad requested changes Oct 30, 2018

View reviewed changes

Santosh Mohan and others added 13 commits October 31, 2018 15:30

Log erroneous datasets to the csv file

d46c8e8

Update as per review comments

30e3598

Add nbart to registered collection names and fix unique field in coll…

fe9dd6e

…ection

Fix integration test scripts

0b26707

Fix integration test scripts

31d3261

Fix travis failures

7feef6b

Fix travis py.test file location issue

0df7bbb

Remove redundant line of code

28d08af

Reformatting and cleaning

7bba80d

- `global` is only required when assigning to a global variable - The product list only needs to be computed once - Change argument ordering to be consistent between functions

More coherence updates, incomplete

a0cadc7

More refactoring of coherence logic

7153877

Update dea-coherence command option help string

d9c5609

Update integration test scripts

20286df

Santosh Mohan and others added 12 commits November 8, 2018 12:28

Fix py.test failures

932486f

Update pytest script to fix codecov/patch failure

5411cc0

Update pytest script to fix codecov/patch failure

6f76591

Appease travis checks

60951a9

Appease travis yamllint checks

4df569c

Fix pytest failures in full sync test script

ceadb7d

Tiny changes, trying to make more grokkable

6780af5

Readability refactoring of coherence

94b8927

Refactor coherence boolean logic

c8ad066

Organise imports and fix deprecation warnings

5b99b49

Merge branch 'develop' into downstream-datasets

883b8ca

Merge pull request #188 from GeoscienceAustralia/damien/downstream-da…

5459173

…tasets More coherence check improvements

omad previously requested changes Nov 23, 2018

View reviewed changes

Santosh Mohan and others added 8 commits November 23, 2018 17:41

Refactor coherence test script

2f6e76e

Print coherence summary using click command instead of structlog way

60c59f7

Merge branch 'develop' into downstream-datasets

ea270af

Fix clirunner output typo mismatch

bc1f278

Merge branch 'develop' into downstream-datasets

e50dc41

Merge branch 'develop' into downstream-datasets

3730742

Merge branch 'develop' into downstream-datasets

36e4295

Update doc string

7bf988e

santoshamohan and others added 4 commits February 12, 2019 15:21

Merge branch 'develop' into downstream-datasets

4ae3d9d

Merge branch 'develop' into downstream-datasets

f2144dd

Add barest earth and s2 products to the collections

5b6f4d9

Merge branch 'develop' into downstream-datasets

652f76e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage downstream datasets with locationless source dataset #178

Manage downstream datasets with locationless source dataset #178

santoshamohan commented Oct 22, 2018 •

edited

Loading

codecov bot commented Oct 22, 2018 •

edited

Loading

omad left a comment

omad left a comment

Manage downstream datasets with locationless source dataset #178

Are you sure you want to change the base?

Manage downstream datasets with locationless source dataset #178

Conversation

santoshamohan commented Oct 22, 2018 • edited Loading

Reason for this pull request

Proposed solutions

codecov bot commented Oct 22, 2018 • edited Loading

Codecov Report

omad left a comment

Choose a reason for hiding this comment

omad left a comment

Choose a reason for hiding this comment

santoshamohan commented Oct 22, 2018 •

edited

Loading

codecov bot commented Oct 22, 2018 •

edited

Loading