-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record intentional omissions from collections #1507
Comments
I can't seem to find a GitHub issue for this, but there has been talk in the past for supporting "known to be absent measurements" when defining dataset documents. This happens with radar data sources for example.
If we had that, then we could solve provenance tracking problem by simply adding null dataset for every derived "per-scene" product that decides to skip a given input scene. The reason for skipping can then be recorded in the derived dataset In fact you can implement that without any code changes, you just need to index "missing datasets" with measurement url pointing to an image containing only |
This is a curly one. Minimal short-term implementation would be @Kirill888 's suggestions above:
The most viable longer term solution is probably allowing loading data from items with missing bands. If we're lucky this might fall out of the multidimensional loading work Kirill is about to embark on. :) |
Being able to mark band as absent for a given dataset, as opposed to pointing to an image with |
Sometime a dataset is deliberately omitted from a collection (due to noise or glitches or any problem too peculiar to have been automatically resolved upstream). For example, a handful of scenes in DEA were identified as being too faulty for WOfS. This information needs to be tracked in the ODC index (and represented by a filesystem artefact in the collection, from which the index record can be recreated) in order to distinguish accidentally missing datasets from deliberately omitted datasets (i.e., so the former but not the latter can be automatically back-processed).
Status quo:
Exclusion of datasets from a collection is ad hoc, which makes the collection difficult to curate. It is not possible to fully automate the detection, reporting and infill of gaps (such as where an ARD dataset exists but the expected corresponding dataset is missing from a derivative product) because there is no standard mechanism to distinguish deliberate exclusion of a dataset (where reprocessing that dataset would reintroduce known problems for downstream users) versus potential datasets that were skipped accidentally and should be reattempted.
Proposal A:
There should be a kind of dummy null dataset, which behaves like a dataset that is not archived and has typical spatial footprint in the ODC index but has a valid data extent with zero area. In other words, the metadata records for null datasets should be returned by
find_datasets
API queries, but the corresponding layers should be filtered out at the data load stage, so that they are not represented in the raster xarray object. The intent here is that the specific dummy dataset UUID should be incorporated into lineages of derivative products (e.g., by statistician) without being visible to the user. This lineage metadata enables a positive explanation to be reconstructed for why a potential data layer was not incorporated into an analysis, enhancing provenance.Proposal B:
Alternatively, there could be a new metadata field (such as
archived: reason: foo
, to appear in STAC documents etc) and a new database constraint created, that if this field exists in the record metadata document then the record archived date can never be set to null.The downside of B is that it conflates the archival mechanism for two unrelated functions: marking datasets that would be too problematic to include in the collection, and marking records that are no longer current. (For example, with process improvements it may become possible to generate a usable dataset where the previous version was already recorded as unusable. But the index has no way to track the relationship between the two versions, and the fact that this unsuitability marker is no longer relevant, if the marker record was already in archived state from the outset. This would make automatic curation processes complicated. This circumstance would be trivial to handle in proposal A, by archiving the dummy dataset, without losing the history.)
The text was updated successfully, but these errors were encountered: