Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Sync Explorer Tables + Crashes #581

Open
omad opened this issue Apr 5, 2024 · 13 comments
Open

Out of Sync Explorer Tables + Crashes #581

omad opened this issue Apr 5, 2024 · 13 comments

Comments

@omad
Copy link
Member

omad commented Apr 5, 2024

While managing the DEA ODC Databases, we've had to do some manual deletions and updates to the agdc database tables, and I suspect this has made Explorer unhappy.

We've used a 'QA' SQL query in the past, which compares the number of datasets explorer knows about vs the number of datasets ODC knows about.

SELECT *, (OUTPUT_QUERY.agdc_ds_count - OUTPUT_QUERY.explorer_ds_count) as diff
FROM (
        select count(ds.id) as agdc_ds_count, p.dataset_count as explorer_ds_count, dt.id, p.name 
        from cubedash.product p, agdc.dataset_type dt, agdc.dataset ds 
        where 
            dt.name = p.name 
        and dt.id = ds.dataset_type_ref 
        and ds.archived is NULL 
        GROUP BY p.name, dt.id, p.dataset_count) AS OUTPUT_QUERY
WHERE NOT OUTPUT_QUERY.agdc_ds_count = OUTPUT_QUERY.explorer_ds_count;

Which shows a bunch of mismatched Products:

 agdc_ds_count | explorer_ds_count |  id  |            name            |  diff   
---------------+-------------------+------+----------------------------+---------
        419777 |            514922 |  479 | ga_ls_fc_3                 |  -95145
        415373 |            417815 |  478 | ga_ls_wo_3                 |   -2442
         39227 |             88696 | 1271 | ga_ls_wo_fq_cyear_3        |  -49469
        503454 |            503453 | 1798 | ga_s2am_ard_3              |       1
         65063 |            184167 | 1006 | ga_s2_ba_provisional_3     | -119104
          8639 |             10050 | 2953 | ga_s2ls_intertidal_cyear_3 |   -1411
         51381 |            204223 | 1600 | ga_s2_wo_provisional_3     | -152842

In an attempt to update this, I thought i could run cubedash-gen --force-refresh on each of the out of sync products, however when I try this, i get a crash:

cubedash-gen --force-refresh ga_s2ls_intertidal_cyear_3
Updating 1 products for LocalConfig<loaded_from=defaults, environment='default', config={'db_hostname': 'dea-dev-db', 'db_database': 'odc', 'db_username': 'explorer_writer', 'db_password': '***', 'index_driver': 'default', 'db_connection_timeout': '60'}>
Generating product summaries...
ga_s2ls_intertidal_cyear_3 refresh
2024-04-05T04:27:02.288888Z [warning  ] spatial_deletion_full_scan     after_date=None product_name=ga_s2ls_intertidal_cyear_3
2024-04-05T04:27:02.326885Z [warning  ] spatial_update.recreating_everything after_date=None product_name=ga_s2ls_intertidal_cyear_3
2024-04-05T04:33:35.828477Z [warning  ] forcing_refresh                product_name=ga_s2ls_intertidal_cyear_3
	 ga_s2ls_intertidal_cyear_3 1900
2024-04-05T04:33:35.996976Z [error    ] product.error                  product=ga_s2ls_intertidal_cyear_3
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/cubedash/generate.py", line 127, in generate_report
    result, updated_summary = store.refresh(
  File "/usr/local/lib/python3.10/dist-packages/cubedash/summary/_stores.py", line 1528, in refresh
    self._recalculate_period(
  File "/usr/local/lib/python3.10/dist-packages/cubedash/summary/_stores.py", line 1374, in _recalculate_period
    summary = self._summariser.calculate_summary(
  File "/usr/local/lib/python3.10/dist-packages/cubedash/summary/_summarise.py", line 170, in calculate_summary
    dict(
TypeError: 'LegacyCursorResult' object is not subscriptable
ga_s2ls_intertidal_cyear_3 error
finished. 1 error

This did seem to succeed in getting the dataset counts in sync, but I'm not sure what state it has left the database in.

Questions

  • Is this crash happening after the Product is entirely updated, or is it only partially done?
  • Does this SQL query make any sense?
  • Should a 'QA' feature like this be included in Explorer?
  • What can we do to make Explorer resilient to Database modifications? Ideally we need to fix ODC to support mutating the database, but there's going to be a few assumptions about datasets not being allowed to be deleted that will have flow on impacts.

Versions

Docker Image: docker.io/opendatacube/explorer@sha256:026ac9f06ad41abdf62dd5c8dc3206f6595bc25564f029bbcec574e64806d317
Explorer version: 2.10.2.dev20+ge0aa69f3
Core version: 1.8.17
Python: 3.10.12
SQLAlchemy Version: 1.4.52

@jeremyh
Copy link
Collaborator

jeremyh commented Apr 5, 2024

Interesting that the error is coming from a "LegacyCursorResult" object, I wonder if sqlalchemy has changed

@omad
Copy link
Member Author

omad commented Apr 5, 2024

That's what it sounds like to me. Or ODC has changed...

@Ariana-B or @SpacemanPaul: I know that there's been work on updating to SQLAlchemy 2 in ODC, Is that only in ODC 1.9+ or was it in 1.8 as well?

@Ariana-B
Copy link
Contributor

Ariana-B commented Apr 5, 2024

I think this might even be a SQLAlchemy 1.4 thing (which I believe explorer does use): https://stackoverflow.com/a/58660606
And from the docs;

Within the scope of the 1.x series of SQLAlchemy, Core SQL results in version 1.4 return an instance of LegacyCursorResult which takes the place of the CursorResult class used for the 1.3 series and previously. This object returns rows as LegacyRow objects, which maintains Python mapping (i.e. dictionary) like behaviors upon the object itself. Going forward, the Row._mapping attribute should be used for dictionary behaviors.

@jeremyh
Copy link
Collaborator

jeremyh commented Apr 5, 2024

I agree that force-refresh is the correct move here — it should replace everything Explorer knows about the product. Will look into the crash, but to answer the other Qs:

What can we do to make Explorer resilient to Database modifications? Ideally we need to fix ODC to support mutating the database, but there's going to be a few assumptions about datasets not being allowed to be deleted that will have flow on impacts.

The issue is that ODC's tables let us efficiently see recently added/updated/archived rows in the ODC schema, but not deleted rows (as they disappear completely).

To handle deletions efficiently (ie, incrementally, without scanning the whole table) I believe we'd need to extend ODC's tables to add a "deleted row log" table that's populated via a trigger. Explorer could do that if we really want it.

The "ideal" way to do it at the moment without code changes is to archive the datasets first, run cubedash-gen so it will see that those datasets as recently archived, and then we're free to delete them from ODC. This was never seen as a large priority as we used to say prod datasets would only ever be archived (so the provenance record remains), but that's idealistic :)

@jeremyh
Copy link
Collaborator

jeremyh commented Apr 5, 2024

Does this SQL query make any sense?
Should a 'QA' feature like this be included in Explorer?

It does, though it's a broad brush to use. The raw product count is one part of ~three. A more thorough query could be put on the audit pages of Explorer. Checking, for instance, the spatial table counts (which should have deletions when a dataset is removed/archived), the combined year counts, and how recent the summary is compared to most-recently-changed datasets in underlying ODC.

@SpacemanPaul
Copy link

SQLAlchemy 2 is only in datacube-1.9.

1.8 should still be on 1.4. But the later versions 1.4.x versions have a lot of changes to help users prepare for the migration to 2.0 - might be one of those.

@pjonsson
Copy link
Contributor

pjonsson commented Apr 6, 2024

I don't know if it is related to this, but after fixing the missing lxml_html_clean dependency (#584), many tests on current HEAD of the develop branch also crash with a TypeError on LegacyCursorResult:

...
            region_counts = Counter(
>               dict(
                    self._engine.execute(
                        select(
                            [
                                DATASET_SPATIAL.c.region_code.label("region_code"),
                                func.count(),
                            ]
                        )
                        .where(where_clause)
                        .group_by("region_code")
                    )
                )
            )
E           TypeError: 'LegacyCursorResult' object is not subscriptable
cubedash/summary/_summarise.py:170: TypeError

@pjonsson
Copy link
Contributor

pjonsson commented Apr 6, 2024

The LegacyCursorResult type errors were first seen on develop after @omad merged #580 on 18 March, and it looks like commit e0aa69f introduced the error (commit a6c1079 just before was also red, but the integration tests passed for that).

@pjonsson
Copy link
Contributor

pjonsson commented Apr 6, 2024

Had a second look, #580 (comment) isn't obvious to me.

@jeremyh
Copy link
Collaborator

jeremyh commented Apr 7, 2024

The LegacyCursorResult type errors were first seen on develop after @omad merged #580 on 18 March, and it looks like commit e0aa69f introduced the error (commit a6c1079 just before was also red, but the integration tests passed for that).

Good catch @pjonsson!

That auto-modernise does break it. Here's a simple reproduction that throws the same exception:

    from sqlalchemy import create_engine, select, func
    from collections import Counter

    engine = create_engine('sqlite:///:memory:')
    result2 = Counter(
        dict(
            engine.execute(
                select(
                    [
                        select(1).label("item"),
                        select(2).label("count")
                    ]
                )
            )
        )
    )

But that's not the only bad part of the auto-modernise — it's also changing the == None lines in sqlalchemy queries to is None.

They have different semantics as the is operator could never be overridden to be part of sqlalchemy's API (but __eq__ could). This could drastically change some logic.

  • (On newer sqlalchemy there's also a more-explict .is_(None)), which didn't exist when Explorer was first written.)

@jeremyh
Copy link
Collaborator

jeremyh commented Apr 7, 2024

This might be a good excuse to move from shed over to ruff, as we have with other repositories. Formatting is fine and we don't need the "modernisation" aspects.

jeremyh added a commit to jeremyh/datacube-explorer that referenced this issue Apr 8, 2024
jeremyh added a commit to jeremyh/datacube-explorer that referenced this issue Apr 8, 2024
@jeremyh
Copy link
Collaborator

jeremyh commented Apr 8, 2024

Reverting those changes in #587

omad pushed a commit that referenced this issue Apr 8, 2024
omad pushed a commit that referenced this issue Apr 8, 2024
Based on issues in #581
@pjonsson
Copy link
Contributor

pjonsson commented Apr 8, 2024

I also just switched to Ruff from black/isort/flake8 and have good experiences so far.

I don't know anything about shed, but I know Ruff lint has a modernisation category that can upgrade various things because it was eager to switch from Optional[t] to t | None on Python 3.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants