Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of top level masking functions for bitmask and enumerated… #187

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

alexgleith
Copy link
Contributor

@alexgleith alexgleith commented Nov 20, 2024

A work in progress, but currently includes functions:

  • odc.geo.masking.bits_to_bool, which makes a mask using bit flags
  • odc.geo.masking.enum_to_bool, which makes a mask using isin lookups
  • scale_and_offset, which applies offset and scale values.

Also:

  • odc.geo.masking.mask_clouds and the two functions below
  • odc.geo.masking.mask_ls: apply an opinionated mask to landsat c2l2sr data
  • odc.geo.masking.mask_s2: apply an opinionated mask to sentinel-2 c1l2 data.

I've added odc xarray extensions for them, which I think works. Tested briefly.

So, thoughts? Opinions? Fears and doubts?

@alexgleith alexgleith marked this pull request as draft November 20, 2024 04:26
@alexgleith alexgleith force-pushed the add-masking-functions branch 2 times, most recently from 2ecd02a to 479dc32 Compare November 20, 2024 04:40
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 95.90164% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.45%. Comparing base (a00c882) to head (39c42cb).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
odc/geo/masking.py 95.53% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           develop     #187    +/-   ##
=========================================
  Coverage    95.44%   95.45%            
=========================================
  Files           31       32     +1     
  Lines         5534     5656   +122     
=========================================
+ Hits          5282     5399   +117     
- Misses         252      257     +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link

github-actions bot commented Nov 20, 2024

@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 04:42 Inactive
@Kirill888
Copy link
Member

@alexgleith I think that chosen approach for representing "bit query" in bits_to_bool is confusing and open to interpretations, in particular the "value part", say you need to check that bit 5 is 1, is value 1<<5 or just 1.

I'd prefer if we just took a string like this

mask = bits_to_bool(xx, "xx01_1xxx")

instead of proposed

mask = bits_to_bool(xx, [3,4,5], b0001_1000)

@Kirill888
Copy link
Member

Other option I guess is a dictionary int -> 0|1, so something like

mask = bits_to_bool(xx, {5:0, 4:1, 3:1})

but that's way more prone to errors: off by 1 or LSB/MSB confusion.

odc/geo/masking.py Outdated Show resolved Hide resolved
odc/geo/masking.py Outdated Show resolved Hide resolved
odc/geo/masking.py Outdated Show resolved Hide resolved
odc/geo/masking.py Outdated Show resolved Hide resolved
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 07:47 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:05 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:09 Inactive
@alexgleith alexgleith force-pushed the add-masking-functions branch from 3a4eb8c to 504bbb8 Compare November 20, 2024 08:10
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:12 Inactive
@alexgleith alexgleith force-pushed the add-masking-functions branch from 504bbb8 to f4f5a5b Compare November 20, 2024 08:13
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:16 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:22 Inactive
@alexgleith alexgleith force-pushed the add-masking-functions branch from 7cfd466 to 39c42cb Compare November 20, 2024 08:23
@github-actions github-actions bot temporarily deployed to pull request November 20, 2024 08:25 Inactive
@alexgleith
Copy link
Contributor Author

Would be great to get your review @robbibt and perhaps an opinionated DEA Fmask implementation mask_dea or mask_fmask?


return mask_clouds(
xx=xx,
qa_name=qa_name,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there is a default qa_name, I'm concerned this might fail without clear information to the user if the data they've loaded does not have the scl band, and they haven't realised. Should there be checks in the mask_s2 and mask_ls functions that check whether the supplied qa_name band is actually available?

SNOW = 5
CLEAR = 6
WATER = 7
# Not sure how to implement these yet...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've developed a way to do this in eo-insights: https://github.com/frontiersi/eo-insights/blob/412cfb5e4b416c1cfc7d52286a8b9eb241e730d2/src/eo_insights/masking.py#L267-L268

A two-bit pair can have four values: 0 = 0b00, 1 = 0b01, 2 = 0b10, 3 = 0b11. You can appropriately represent this by inserting the requested integer value at the minimum bit using integer_value << min(bits) such that a value of 2 at bits [2, 3] for a 4-bit integer produces 0b1000 and a value of 3 at bits [2, 3] for a 4-bit integer produces 0b1100.

Comment on lines +317 to +319
"""
Perform cloud masking for Sentinel-2 L2A products.
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything to prevent the user from mistakenly using these functions on inappropriate datasets? What happens if the user has loaded Harmonised Landsat Sentinel-2, which uses fmask, which has a different set of flag definitions (and nodata value) to either of the defaults included here? They would need to know that they can't use either of these functions, and have to use the mask_clouds function with custom variables relevant to their product. I'm worried that the inclusion of opinionated masking here is going to trip people up, and there need to be appropriate checks to inform the user that they might be using the function inappropriately.


# "Scales" and "offsets" is used by GDAL.
if scale is None:
scale = xx.attrs.get("scales")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if these attributes aren't available? Should you return None?

@caitlinadams
Copy link

@alexgleith -- is there any chance you could provide a scratch script or notebook that loads the Landsat and Sentinel-2 products you have opinionated masking for, either from a single tiff, or a STAC query? I'd like to run a few tests to understand what might happen if a user applies the opinionated masking functions naively and makes a mistake. You could just send this to me directly if you don't want to have it as part of the repo.

@alexgleith
Copy link
Contributor Author

is there any chance you could provide a scratch script or notebook that loads the Landsat and Sentinel-2

Here's a notebook: https://gist.github.com/alexgleith/84ebc85e17b904a48319845e42947352

@caitlinadams
Copy link

Thanks for putting this together, Alex! At this stage, my main concern is that it would be very easy for the user to misuse mask_ls and mask_s2. Given this, my current preference would be to implement mask_clouds as part of odc-geo, and move the opinionated elements to a different form. This could be a repo with importable configurations, or providing examples to users about what values they need to put into mask_clouds for different products, or something else we haven't thought of yet!

What do you think?

Copy link
Contributor

@robbibt robbibt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General feedback:

This is brilliant, thanks so much for working on this @alexgleith - getting masking functionality into odc-geo is something I've wanted for a long time, and I can see this making many of our current workflows so much easier.

I do think I agree with Caitlin that that we should probably break this PR into two chunks to consider seperately. The generic tooling is "easier" to progress, and is a perfect fit for this repo. However, the opinionated masking is trickier, especially given all the possible edge cases that can occur when users naively start applying the functions to various datasets (e.g. USGS Level 1 Landsat, USGS Level 2, DEA Collection 3, DE Africa, E84, HLS etc). So perhaps, just include the following functions in this initial PR:

  • bits_to_bool: generic bitmasking
  • enum_to_bool: generic "is-in" masking
  • scale_and_offset: generic re-scaling of data by scale/offsets
  • mask_invalid_data: generic conversion of integer nodata values to NaN (given both this and scale_and_offset convert integer data to floating point, I wonder if mask_invalid_data should become just a special variant of this function and call scale_and_offset under-the-hood?)
  • mask_clouds: generic framework/workflow for masking clouds from a dataset (e.g. qa band selection, re-scaling, mask creation, mask application

For mask_clouds in particular, I think it would be really nice if rather than passing in a custom function, a user could simply provide the bit flags or the enumerated values directly to the function, e.g. perhaps something like this:

# Enum masking
ds.odc.mask_clouds(qa_name="fmask", qa_enum=[0, 4], scale=0.0001, offset=0.0)

# Bit masking
ds.odc.mask_clouds(qa_name="pixel_qa", qa_bits=[4, 3], scale=0.0000275, offset=-0.2)

That would let mask_clouds be used for generic cloud masking tasks and for future satellite datasets with different sets of masking bands and flags than we have seen so far.

Specific feedback:

  1. .odc.mask_invalid_data is great - although the first time I ran it, I was suprised that it operated "in-place", e.g. modifying the existing array without having to assign it to a new output. Is this intended? Both approaches could be useful - I guess we could include an inplace=False param in there to allow users to decide how they want to apply it.

    ds.odc.mask_invalid_data()
    # vs
    ds_masked = ds.odc.mask_invalid_data()
    
  2. .odc.enum_to_bool works a treat! Again, this one returns a new output instead of modifying data in-place, so we should probably standardise to make sure functionality is expected.

  3. I've tested odc.bits_to_bool on our WO data, and it works great for masking by individual flags, e.g. ds.water.isel(time=0).odc.bits_to_bool(bits=[7]) - I actually quite like being able to pass in a simple list of bits. However, I'm not sure how to use it for more complex things like this (which I think relates to @caitlinadams' comments above):

    image

    Bitmasking is something I've found super confusing forever, and I know it's something our users struggle with a lot too. The datacube.utils.make_mask function is probably still the most intuitive implementation I've seen, where users can do something like this using human-readable flags. I wonder if we could extent this function to allow something similar here, if variable flags exist in the data, or if a user manually provides them?

    make_mask(ds.water, water_observed=True)
    
  4. I tried running ds.odc.scale_and_offset(scale=0.0001) on DEA's ARD scaled between 0-10000. It did work, but it also operated in-place which was unexpected, and it appeared to restore nodata values back into the dataset even after converting to float (so the data had values from -999.0 to 1.0. I think it's probably a safe assumption that nodata values can be set to NaN after rescaling to floating point - I don't think having an integer nodata value is useful at that point.

def bits_to_bool(
xx: DataArray,
bits: Sequence[int] | None = None,
bitflags: int | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that bits and bitflags are mutually exclusive, e.g. users could provide either option? If so, we should clarify this in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if you and Kirill both find this confusing, perhaps it's problematic.

Any suggestions on something better? Improving docs is simple enough... but maybe there's a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, not super confusing, just need to be clear in the docs that a user doesn't need to provide both I think (and most users except the most advanced ones will probably ignore bitflags completely).

) -> DataArray | Dataset:
"""
Apply scale and offset to the DataArray. Leave scale and offset blank to use
the values from the DataArray's attrs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these attributes that come with a particular STAC extension? If so, makes me wonder if we should configure these in our DEA datacube products too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's fairly standard. I'm not sure of the odc-stac library sets then, but it should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rioxarray does put them in (it takes those from TIFF tags). Unlike rioxarray, odc-stac loads from multiple sources that, at least in theory, can have different scale and offset for each asset being loaded, also that information can come from either STAC or from TIFF/netcdf.

There is also this possible confusion of "have these parameters been applied already or not?".

@robbibt Scale and offset are part of stac raster extension, see

https://odc-stac.readthedocs.io/en/latest/stac-best-practice.html

in STAC land it means something like "pixels are stored in a quantized form as an integer, to get back to floating point use these scale and offset parameters for all values that are not NODATA, for NODATA use NaN".

:param scale: Scale value for the dataset
:param offset: Offset value for the dataset
:param clip: Clip range for the data
:param includ_cirrus: Whether to include cirrus in the mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring here doesn't match the current params

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, on it.

xx: DataArray | Dataset,
scale: float | None = None,
offset: float | None = None,
clip: Annotated[Sequence[int | float], 2] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this can be applied at the dataset level, we should add a "skip_bands" param here to match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@robbibt
Copy link
Contributor

robbibt commented Nov 22, 2024

I guess one other thing to consider here is that there's a bunch of similar masking tools in odc.algo which I believe were written to optimise Dask performance: https://github.com/opendatacube/odc-algo/blob/main/odc/algo/_masking.py

We should see if any of that content can be re-used here too.

@alexgleith
Copy link
Contributor Author

After chatting with @caitlinadams today, I think I'll suggest that we make the following changes:

  • Remove the mask_ls and mask_s2 functions, as they are oversimplified and dangerous
  • We can perhaps replace them with dictionaries, which provide opinionated (and editable) config that is passed into the more generic mask_clouds function, so that way it can still be easy, like ds.mask_clouds(**USGS_LS_L2SR) perhaps.

And a range of code cleanup and doc fixes, plus examples are required.

@robbibt
Copy link
Contributor

robbibt commented Nov 27, 2024

After chatting with @caitlinadams today, I think I'll suggest that we make the following changes:

  • Remove the mask_ls and mask_s2 functions, as they are oversimplified and dangerous
  • We can perhaps replace them with dictionaries, which provide opinionated (and editable) config that is passed into the more generic mask_clouds function, so that way it can still be easy, like ds.mask_clouds(**USGS_LS_L2SR) perhaps.

And a range of code cleanup and doc fixes, plus examples are required.

I really like the idea of being able to pass in custom dictionary kwarg configs - I think that would allow us to have a really nice generic function that can be highly customisable and support a wide range of current and future sensors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants