-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset CLI tool to find duplicates #1517
Conversation
As discussed on Teams: You have in sql common_timestamp returns a I think we need to explicitly convert to UTC, probably with the AT TIME ZONE operator. (Or to the system time zone? But that would complicate testing) (Classic gotcha - virtually never what any real world user would actually want, and contrary to the SQL standard.) |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #1517 +/- ##
===========================================
+ Coverage 91.75% 91.79% +0.03%
===========================================
Files 132 132
Lines 14552 14617 +65
===========================================
+ Hits 13352 13417 +65
Misses 1200 1200 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to be a pain and ask you to update the memory
index driver implementation as well - should be heaps easier than the ones that use a database.
* add dataset cli tool to find duplicates * convert timestamps to UTC * update whats_new * update memory driver search duplicates implementation --------- Co-authored-by: Ariana Barzinpour <[email protected]>
* Update whats_new.rst for 1.8.17 release. (#1510) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/adrienverge/yamllint.git: v1.32.0 → v1.33.0](https://github.com/adrienverge/yamllint.git/compare/v1.32.0...v1.33.0) * Bump conda-incubator/setup-miniconda from 2 to 3 Bumps [conda-incubator/setup-miniconda](https://github.com/conda-incubator/setup-miniconda) from 2 to 3. - [Release notes](https://github.com/conda-incubator/setup-miniconda/releases) - [Changelog](https://github.com/conda-incubator/setup-miniconda/blob/main/CHANGELOG.md) - [Commits](conda-incubator/setup-miniconda@v2...v3) --- updated-dependencies: - dependency-name: conda-incubator/setup-miniconda dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> * Dataset CLI tool to find duplicates (#1517) * add dataset cli tool to find duplicates * convert timestamps to UTC * update whats_new * update memory driver search duplicates implementation --------- Co-authored-by: Ariana Barzinpour <[email protected]> * Make solar_date() timezone aware. (#1521) * Warn if non-eo3 dataset has eo3 metadata type (#1523) * warn if non-eo3 dataset has eo3 metadata type * fix str contains * add test --------- Co-authored-by: Ariana Barzinpour <[email protected]> * Fix merge oops * Resolve some merge issues arising fromm differences between SQLAlchemy 1.4 and 2.0. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ariana-B <[email protected]> Co-authored-by: Ariana Barzinpour <[email protected]>
Reason for this pull request
Easily scan the database for duplicate indexed datasets, identifying duplicates by a set of specified field values, and produce a report on any datasets found.
Proposed changes
New dataset CLI tool,
find-duplicates
that takes in fields by which to search for duplicates and optional products in which to searchAdd
expression_with_leniency
property toDateRangeDocField
to extend date range by +-500msCloses #xxxx
Tests added / passed
Fully documented, including
docs/about/whats_new.rst
for all changes📚 Documentation preview 📚: https://datacube-core--1517.org.readthedocs.build/en/1517/