Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split series and spatial DAGs #66

Closed
wants to merge 4 commits into from
Closed

Split series and spatial DAGs #66

wants to merge 4 commits into from

Conversation

Lun4m
Copy link
Collaborator

@Lun4m Lun4m commented Apr 29, 2024

  • Are both series and spatial tests required? If so, we can drop the Option I used in ScheduleDag.
    Otherwise, we need to add a check before calling the validate methods, to make sure they are not None.
    Or when constructing the subdag, add something like

    let subdag = Scheduler::construct_subdag(
        self.dag
            .spatial
            .as_ref()
            .ok_or(Error::DagIsNone)?,
        tests,
    )?;
  • Also not sure about the best name and place for ScheduleDag.

@intarga
Copy link
Member

intarga commented Apr 29, 2024

So, as discussed in our call, the distinction should really be "fresh data" DAG vs periodic DAG, and thinking about it some more, we don't necessarily need to have two separate DAG objects, it's actually probably fine to have one object with two disjoint graphs in it.

What I've really realised looking at this now though, is that this work is really much more tightly coupled to a bigger refactor I have planned than I realised before.

When I started working on this, I didn't have any clear answers on what tests would need to be run and how they would relate to each other, so I had to make some assumptions and move forward based on those. Now that I've finally been given a list of timeseries tests to replicate and some prospective pipelines, a bunch of cases that invalidate my assumptions have come up:

  1. The distinction between "fresh data" and periodic, does not cleanly map on to timeseries vs spatial. While I still think it's true that spatial tests can't be done on fresh data, some timeseries tests also cannot be done on fresh data. Namely "dip_check" requires an observation from the future, and min_gt (a check between related climate params) often requires derivative timeseries, that are cannot be produced until later.
  2. Somewhat related to (1) I was under the impression that timeseries tests operate on a single timeseries. This is untrue as tests like min_gt require multiple timeseries.
  3. I was under the impression that tests that depend on other tests would simply accept the flag information from previous tests (although even this currently isn't kept around as state) and handle it themselves, but it now seems that we need to do some active filtering on the rove side to remove flagged observations from the data set before putting it into tests.

The ultimate end goal of the big refactor (I think) is to end up with API endpoints validate_fresh and validate_periodic instead of validate_series and validate_spatial, but it's going to be a lot of work to get there, so I suggest we break it down a bit:

Firstly, the "fresh data" case seems like a simplified special case of the periodic one, so I suggest we merge validate_spatial and validate_series into validate_periodic first, and tackle validate_fresh later.

The reason why the API endpoints are separate to begin with, is the different shape of spatial vs series data (see the SeriesCache and SpatialCache structs), and the different shape of specifiers required to get them from a data source (i.e in the case of frost you need station_id, element_id, start_time, end_time to get series data, but element_id, timestamp, and polygon to get spatial).

Given that, I propose these steps:

  1. We try to unify/reconcile SeriesCache and SpatialCache into a single data structure that can be indexed appropriately for both kinds of test. The hard part I foresee here will be handling cases where timeseries have different timeresolution (5min data and 10min data). I think this makes sense to tackle first since it has no dependencies, and it's the most likely to go wrong and require a rethink.
  2. Once/if we've managed that, we change the DataConnector trait and its implementors to have only the one method.
  3. We merge validate_spatial and validate_series, and change the proto file to reflect this.
  4. Tackle filtering out flagged data from tests.
  5. Tackle validate_fresh.

We can discuss this further on video tomorrow.

@Lun4m Lun4m mentioned this pull request May 6, 2024
5 tasks
@intarga intarga closed this May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants