Split series and spatial DAGs #66

Lun4m · 2024-04-29T14:30:50Z

Are both series and spatial tests required? If so, we can drop the Option I used in ScheduleDag.
Otherwise, we need to add a check before calling the validate methods, to make sure they are not None.
Or when constructing the subdag, add something like
```
let subdag = Scheduler::construct_subdag(
    self.dag
        .spatial
        .as_ref()
        .ok_or(Error::DagIsNone)?,
    tests,
)?;
```
Also not sure about the best name and place for ScheduleDag.

intarga · 2024-04-29T17:31:58Z

So, as discussed in our call, the distinction should really be "fresh data" DAG vs periodic DAG, and thinking about it some more, we don't necessarily need to have two separate DAG objects, it's actually probably fine to have one object with two disjoint graphs in it.

What I've really realised looking at this now though, is that this work is really much more tightly coupled to a bigger refactor I have planned than I realised before.

When I started working on this, I didn't have any clear answers on what tests would need to be run and how they would relate to each other, so I had to make some assumptions and move forward based on those. Now that I've finally been given a list of timeseries tests to replicate and some prospective pipelines, a bunch of cases that invalidate my assumptions have come up:

The distinction between "fresh data" and periodic, does not cleanly map on to timeseries vs spatial. While I still think it's true that spatial tests can't be done on fresh data, some timeseries tests also cannot be done on fresh data. Namely "dip_check" requires an observation from the future, and min_gt (a check between related climate params) often requires derivative timeseries, that are cannot be produced until later.
Somewhat related to (1) I was under the impression that timeseries tests operate on a single timeseries. This is untrue as tests like min_gt require multiple timeseries.
I was under the impression that tests that depend on other tests would simply accept the flag information from previous tests (although even this currently isn't kept around as state) and handle it themselves, but it now seems that we need to do some active filtering on the rove side to remove flagged observations from the data set before putting it into tests.

The ultimate end goal of the big refactor (I think) is to end up with API endpoints validate_fresh and validate_periodic instead of validate_series and validate_spatial, but it's going to be a lot of work to get there, so I suggest we break it down a bit:

Firstly, the "fresh data" case seems like a simplified special case of the periodic one, so I suggest we merge validate_spatial and validate_series into validate_periodic first, and tackle validate_fresh later.

The reason why the API endpoints are separate to begin with, is the different shape of spatial vs series data (see the SeriesCache and SpatialCache structs), and the different shape of specifiers required to get them from a data source (i.e in the case of frost you need station_id, element_id, start_time, end_time to get series data, but element_id, timestamp, and polygon to get spatial).

Given that, I propose these steps:

We try to unify/reconcile SeriesCache and SpatialCache into a single data structure that can be indexed appropriately for both kinds of test. The hard part I foresee here will be handling cases where timeseries have different timeresolution (5min data and 10min data). I think this makes sense to tackle first since it has no dependencies, and it's the most likely to go wrong and require a rethink.
Once/if we've managed that, we change the DataConnector trait and its implementors to have only the one method.
We merge validate_spatial and validate_series, and change the proto file to reflect this.
Tackle filtering out flagged data from tests.
Tackle validate_fresh.

We can discuss this further on video tomorrow.

Lun4m added 2 commits April 29, 2024 15:36

Split series and spatial DAGs

0b85c96

Move ScheduleDag

c0d6b46

Lun4m added 2 commits April 30, 2024 11:15

Merge branch 'metno:trunk' into split_dags

9d9bb28

Fix lint error

947dfcd

Lun4m mentioned this pull request May 6, 2024

Spatial/Series refactor tracker #69

Open

5 tasks

intarga closed this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split series and spatial DAGs #66

Split series and spatial DAGs #66

Lun4m commented Apr 29, 2024 •

edited

Loading

intarga commented Apr 29, 2024

Split series and spatial DAGs #66

Split series and spatial DAGs #66

Conversation

Lun4m commented Apr 29, 2024 • edited Loading

intarga commented Apr 29, 2024

Lun4m commented Apr 29, 2024 •

edited

Loading