Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ranges? #33

Open
m-mohr opened this issue Jun 30, 2022 · 30 comments
Open

Ranges? #33

m-mohr opened this issue Jun 30, 2022 · 30 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@m-mohr
Copy link
Contributor

m-mohr commented Jun 30, 2022

It comes up over and over again, the range values. Recently in #31. A common example seems to be something like:

  • "categorical" no-data values (e.g. -1, -3)
  • One range with continuous data (e.g. >= 0)

Should we cater for this? I think the simplest solution would be to allow for value an array with two values that can on one side ne null (for open-ended range) as defined also by the STAC Collection extents.

Then you could have something like:

{
...
          "unit": "mm",
          "classification:classes": [
            {
              "value": -1,
              "name": "missing-value",
              "description": "Missing value (no-data)",
              "nodata": true
            },
            {
              "value": -3,
              "name": "no-coverage",
              "description": "No coverage (no-data)",
              "nodata": true
            },
            {
              "value": [0, null],
              "name": "data",
              "description": "Actual data values in mm"
            }
          ],
...
}
@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

One issue that might occur is to define > 0, then you'd need to do something like [0.000000000000000000000000000000000000000001, null]

So an alternative would be to allow a minimal subset of JSON Schema (minimum, maximum, exclusiveMinimum, exclusiveMaximum) and allow an object instead of an array, e.g. for > 0:

{
  "exclusiveMinimum": 0
}

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

would it be terrible to be super explicit like

"classification:classes": [
            {
              "value": -1,
              "name": "missing-value",
              "description": "Missing value (no-data)",
              "nodata": true
            },
            {
              "values": [-3, 1, 7],
              "name": "no-coverage",
              "description": "No coverage (no-data)",
              "nodata": true
            },
            {
              "range": [0, null], # or json schema object...
              "name": "data",
              "description": "Actual data values in mm"
            }
          ],

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

Yes, I think it is terrible ;-) How would you decide whether [1,2] is a range from 1 to 2 or the two categorical values 1 and 2?

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

that's why the keys are explicit value, values, and range

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

Ooooh, I didn't catch that difference. Sorry. I don't think that is necessary, it is more complicated to describe and read but doesn't give any obvious benefit to me?

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

I just don't like that [1, null] is a magic range while [1, 255] is ambiguous as a range or a list of values.

I do really like ranges that are json schema objects.

and of course I still don't like putting ranges into classes ;), but I want to at least get somewhere with the concept.

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

I did not consider lists of values in my proposal yet because you can emulate them just by having the classes multiple times while you can't reasonable express continuous ranges. But yeah, the full-fledged solution would be:

  • integer: single categorical value
  • array of integers: multiple categorical values
  • json schema like object: continuous ranges

Not sure whether we should cater for all, the use cases I've heard about so far were only continuous ranges.

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

Alternatively, we could try something like the following although I'm not sure that would be valid in raster as we give "made-up" statistics / exclude no-data values from statistics. It also feels less intuitive.


      "raster:bands": [
        {
          "unit": "mm",
          "data_type": "float64",
          "statistics": {
            "minimum": 0
          },
          "classification:incomplete": true,
          "classification:classes": [
            {
              "value": -1.0,
              "name": "missing-value",
              "description": "Missing value (no-data)",
              "nodata": true
            },
            {
              "value": -3.0,
              "name": "no-coverage",
              "description": "No coverage (no-data)",
              "nodata": true
            }
          ]
        }
      ],

Thoughts, @emmanuelmathot ?

@m-mohr m-mohr added the question Further information is requested label Jun 30, 2022
@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

Any opinion on preferring

  "maximum": 10.5,
  "exclusiveMaximum": true

versus

  "exclusiveMaximum": 10.5

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

I'd follow JSON Schema as we already use it in other places, which (except for the outdated draft-4) use numbers instead of boolean flags: https://json-schema.org/understanding-json-schema/reference/numeric.html#range

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

ok, didn't realize I was looking at an older draft 👍

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

"classification:incomplete": true, is interesting to me because saying

"range": [0, null], # or json schema object...
              "name": "data",
              "description": "Actual data values in mm"

seems redundant or out of place when the data set is describing rain fall in mm and a negative depth doesn't make sense.

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

Yeah, I'm liking it more the more I'm thinking about it but it's less flexible and covers only some use cases, I assume. Also, it doesn't seem so wrong to exclude no-data values from statistics because they are usually always just made-up values for the file format that doesn't support encoding them properly. I guess we only need to clarify in raster that no-data values are invalid pixel values and as should not be reflected in statistics etc. On the other hand, statistics are usually real min/max values while what we want to describe here are theoretical min and max values. For example, if you have a raster with precipation values, the min and max could be 1, 5 and 10 so min/max are 1 and 10, although the potential range is 0 to infinity (mostly). But maybe that's not an issue?!

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

are we really saying that this is a continuous dataset with classed nodata and should have something roughly like:

"nodata": {
   "classification:classes": {
        ... classes
}

with something else that says that clarifies that the data range of possible values does not include the full range of the datatype?

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

Hmm, then I still don't have a way to express no-data values and their meanings in STAC. In file it was removed, in raster it got somewhat rejected. I really just want to express -1 is missing value, -3 is no coverage for example. And it seems it would fit in here.

Sorry, misunderstood you initially. But still not sure, I think I like the proposal above more, because it just adds an additional field hier instead of adding a new data type to an existing field. #33 (comment)

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

yes, the question is more "is classification a good enough home for nodata" versus "nodata can be messy enough to warrant some kind of new extension that can use classification if needed" and I understand not wanting to start another extension...

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

Well, nodata is already part of raster so would be a change in that extension. But I don't like putting classification:classes into so many different places. Also, if you have no-data values and categorical values in a file, do you really want to have them in two different places?

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

classification: ¯\_(ツ)_/¯: true

@drwelby
Copy link
Collaborator

drwelby commented Jun 30, 2022

The more I think about it, saying "this dataset uses classes but isn't classified" seems reasonable and simple.

@m-mohr
Copy link
Contributor Author

m-mohr commented Jun 30, 2022

I created PR #34 to discuss a potential solution more closely.

m-mohr added a commit that referenced this issue Jun 30, 2022
@m-mohr m-mohr linked a pull request Jun 30, 2022 that will close this issue
@emmanuelmathot
Copy link
Member

Alternatively, we could try something like the following although I'm not sure that would be valid in raster as we give "made-up" statistics / exclude no-data values from statistics. It also feels less intuitive.


      "raster:bands": [
        {
          "unit": "mm",
          "data_type": "float64",
          "statistics": {
            "minimum": 0
          },
          "classification:incomplete": true,
          "classification:classes": [
            {
              "value": -1.0,
              "name": "missing-value",
              "description": "Missing value (no-data)",
              "nodata": true
            },
            {
              "value": -3.0,
              "name": "no-coverage",
              "description": "No coverage (no-data)",
              "nodata": true
            }
          ]
        }
      ],

Thoughts, @emmanuelmathot ?

statistics field represents stats about the distribution of ALL pixels in the band ¯_(ツ)_/¯ but using for stats of only VALID PIXELS and thus define boundaries is not strictly forbidden :-). For instance, we use that information to help user selecting the possible range. In this case, this could be interesting.

image

@pjhartzell
Copy link
Contributor

I did not consider lists of values in my proposal yet because you can emulate them just by having the classes multiple times while you can't reasonable express continuous ranges. But yeah, the full-fledged solution would be:

  • integer: single categorical value
  • array of integers: multiple categorical values
  • json schema like object: continuous ranges

Not sure whether we should cater for all, the use cases I've heard about so far were only continuous ranges.

I like the "full-fledged solution". However, even if the array of integers doesn't make it in, I prefer the json schema like object for continuous ranges for its clarity; it also leaves the door open to adding arrays of integers without having to change how continuous ranges are expressed.

Mocking up classification:classes for a VIIRS vegetation index band:

{
    ...
        "scale": 0.0001,
        "data_type": "int16",
        "classification:classes": [
            {
                "value": -13000,
                "name": "fill_land",
                "description": "Fill value over land",
                "nodata": true
            },
            {
                "value": -15000,
                "name": "fill_water",
                "description": "Fill value over ocean or fresh water",
                "nodata": true
            },
            {
                "value": {
                    "minimum": -10000,
                    "maximum": 10000
                },
                "name": "data",
                "description": "Valid range of vegetation index values"
            }
        ],
    ...
}

Perhaps not necessary, but it is nice to be able to describe the valid range of vegetation index data (a defined subset of the possible int16 values).

@drwelby
Copy link
Collaborator

drwelby commented Jul 1, 2022

To me describing the valid range of a continuous dataset has nothing to do with classification. I'm not sure how a client can or should deal with that class when it isn't a class at all.

@pjhartzell
Copy link
Contributor

pjhartzell commented Jul 1, 2022

@drwelby I see your point, I think. I suppose the same argument could be made for any continuous range? Or is it particular to the valid range?

@drwelby
Copy link
Collaborator

drwelby commented Jul 1, 2022

To me the valid range is akin to raster:bits_per_sample and should live there.

@pjhartzell
Copy link
Contributor

Yep, I see the connection to bits_per_sample. In this case, the range doesn't fit cleanly into a set number of bits. But I get your point about it not being a class. I'm not concerned about including this information, so we don't need to take this any further. On the face of it, it seemed like it would make sense to describe the data range since the no-data values are also being described. But if there is no value on the client end, then no point. 🙂

@m-mohr
Copy link
Contributor Author

m-mohr commented Jul 18, 2022

From the STAC call: No one screamed at me when I said "ranges" are no categories. ;-)

I think we can leave this open for further feedback, but I won't push for a change here. If you only want to describe a single class of valid values (e.g. >= 0), then consider using the statistics or histogram in raster:bands.

@m-mohr m-mohr added the help wanted Extra attention is needed label Jul 18, 2022
@m-mohr m-mohr mentioned this issue Dec 13, 2022
@pjhartzell
Copy link
Contributor

Here's an example where allowing a range for the Class object value could have been useful:

image

The cover change values are interpreted as <from class><to class>, e.g., a value of 12 indicates a change from class 1 to class 2. So they could all be mapped to unique categories. But that seems overkill.

@m-mohr
Copy link
Contributor Author

m-mohr commented Dec 13, 2022

@pjhartzell How would you want to expose that exactly? 12-21, 23-32, 34-43, ...? or just 12-87?

@pjhartzell
Copy link
Contributor

For this case, [12-21, 23-32, 34-43] would be ideal. [12-87] would be a fallback if multiple ranges can't be expressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants