Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset CLI tool to find duplicates #1517

Merged
merged 5 commits into from
Dec 4, 2023
Merged

Conversation

Ariana-B
Copy link
Contributor

@Ariana-B Ariana-B commented Nov 30, 2023

Reason for this pull request

Easily scan the database for duplicate indexed datasets, identifying duplicates by a set of specified field values, and produce a report on any datasets found.

Proposed changes

  • New dataset CLI tool, find-duplicates that takes in fields by which to search for duplicates and optional products in which to search

  • Add expression_with_leniency property to DateRangeDocField to extend date range by +-500ms

  • Closes #xxxx

  • Tests added / passed

  • Fully documented, including docs/about/whats_new.rst for all changes


📚 Documentation preview 📚: https://datacube-core--1517.org.readthedocs.build/en/1517/

@SpacemanPaul
Copy link
Contributor

As discussed on Teams:

You have in sql agdc.common_timestamp( str-value-from-json )::timestamp

common_timestamp returns a timestamp with timezone, then casting to timestamp with ::timestamp returns a timestamp WITHOUT timezone by IGNORING the timezone.

I think we need to explicitly convert to UTC, probably with the AT TIME ZONE operator. (Or to the system time zone? But that would complicate testing)

(Classic gotcha - virtually never what any real world user would actually want, and contrary to the SQL standard.)

Copy link

codecov bot commented Dec 1, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (9fb8c8c) 91.75% compared to head (94d6a5b) 91.79%.
Report is 2 commits behind head on develop.

Files Patch % Lines
datacube/index/memory/_datasets.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1517      +/-   ##
===========================================
+ Coverage    91.75%   91.79%   +0.03%     
===========================================
  Files          132      132              
  Lines        14552    14617      +65     
===========================================
+ Hits         13352    13417      +65     
  Misses        1200     1200              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Ariana-B Ariana-B marked this pull request as ready for review December 1, 2023 04:14
Copy link
Contributor

@SpacemanPaul SpacemanPaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be a pain and ask you to update the memory index driver implementation as well - should be heaps easier than the ones that use a database.

@Ariana-B Ariana-B requested a review from SpacemanPaul December 4, 2023 01:24
@SpacemanPaul SpacemanPaul merged commit 7c738fd into develop Dec 4, 2023
31 checks passed
@SpacemanPaul SpacemanPaul deleted the find_duplicate_indexed branch December 4, 2023 04:25
SpacemanPaul pushed a commit that referenced this pull request Dec 19, 2023
* add dataset cli tool to find duplicates

* convert timestamps to UTC

* update whats_new

* update memory driver search duplicates implementation

---------

Co-authored-by: Ariana Barzinpour <[email protected]>
SpacemanPaul added a commit that referenced this pull request Dec 20, 2023
* Update whats_new.rst for 1.8.17 release. (#1510)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/adrienverge/yamllint.git: v1.32.0 → v1.33.0](https://github.com/adrienverge/yamllint.git/compare/v1.32.0...v1.33.0)

* Bump conda-incubator/setup-miniconda from 2 to 3

Bumps [conda-incubator/setup-miniconda](https://github.com/conda-incubator/setup-miniconda) from 2 to 3.
- [Release notes](https://github.com/conda-incubator/setup-miniconda/releases)
- [Changelog](https://github.com/conda-incubator/setup-miniconda/blob/main/CHANGELOG.md)
- [Commits](conda-incubator/setup-miniconda@v2...v3)

---
updated-dependencies:
- dependency-name: conda-incubator/setup-miniconda
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>

* Dataset CLI tool to find duplicates (#1517)

* add dataset cli tool to find duplicates

* convert timestamps to UTC

* update whats_new

* update memory driver search duplicates implementation

---------

Co-authored-by: Ariana Barzinpour <[email protected]>

* Make solar_date() timezone aware. (#1521)

* Warn if non-eo3 dataset has eo3 metadata type (#1523)

* warn if non-eo3 dataset has eo3 metadata type

* fix str contains

* add test

---------

Co-authored-by: Ariana Barzinpour <[email protected]>

* Fix merge oops

* Resolve some merge issues arising fromm differences between SQLAlchemy 1.4 and 2.0.

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ariana-B <[email protected]>
Co-authored-by: Ariana Barzinpour <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants